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8BCTI0X  1 
IMTBODDCnOi 

1.1  QBIBBAL 

The  topic  of  distributed  systems  Is  a  major  research  area  at  the  Georgia 
Institute  of  technology  with  the  particular  class  of  systems  being  examined. 
Fully  Distributed  Processing  Systems  FDPS.  described  In  a  definitional 
paper  by  Enslow  [EnslTS].  In  recent  years,  the  phrase  "distributed  systems" 
has  become  an  extremely  popular  term  for  both  research  and  marketing;  thus  the 
meaning  of  the  term  has  become  very  Imprecise.  For  that  reason,  we  at  Georgia 
Tech  have  further  Identified  our  particular  area  of  Interest  as  "fullv" 
distributed  processing  systems  (FDPS).  The  major  factor  differentiating  our 
work  from  that  of  others  Is  that  we  are  assuming  a  network  of  very  loosely 
coupled  processors.  This  Is  an  important  distinction  to  bear  In  mind,  since 
our  view  of  distributed  systems  Is  somewhat  different  from  that  of  other 
researchers  as  a  result  of  assuming  this  characteristic. 

Conceptually,  an  FDPS  consists  of  a  loosely-coupled  network  of  Indepen¬ 
dent  machines.  Each  machine  Is  capable  of  commutiloatlng  with  other  machines 
and  controls  a  set  of  local  physical  and  logical  resources  (e.g. .  processors, 
memory,  files,  devices,  etc.}.  The  machines  are  autonomous  In  that  each 
piTocessor  or  server  has  final  responsibility  for  the  control  of  the  resources 
It  provides.  A  layer  of  control  Is  Imposed  on  this  network  of  maohlnes  to 
achieve  unification  of  resources,  cooperation,  and  system  transparency.  It  Is 
assumed  that  all  machines,  while  retaining  their  autonomy,  follow  a  conmon 
master  plan  to  attain  effective  cooperation  between  the  loosely-coupled 
logical  as  well  as  physical  resources. 

The  primary  goal  of  the  Georgia  Tech  Research  Program  In  Fully 
Distributed  Processing  Systems  Is  to  develop  the  technology  necessary  to 
design.  Implement  and  operate  very  loosely-coupled  systems.  Such  systems 
should  be  capable  of  operating  In  dynamic  system  configurations  with  a  high 
degree  of  cooperation  In  providing  services  requested  from  the  system  as  a 
whole.  The  various  research  Issues  that  have  been  Identified  thus  far  Include 
such  topics  as  distributed  operating  systems.  programming  languages, 
theoretical  and  formal  studies,  distributed  data  bases,  physical  Interconnec¬ 
tion  and  message  transportation,  fault  tolerance,  and  security,  among  others. 
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Thla  particular  report  dlaousaes  ayatea  aupport  oapabllltlea  (SSC*a)  to 
support  the  actlvltlea  of  dealgn,  analjala,  lapleaentatlon,  utilization,  and 
management  control  of  fully-dlatrlbuted/looaely-coupled  prooeaalng  ayatemia. 
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1.2  POUPOSK  i2£  this  stddt 

It  Is  accepted  that  there  will  be  some  data  processing  applications  or 
collections  of  applications  for  which  some  fora  of  distributed  processing  Is 
the  only  reasonable  design  philosophy,  llils  study  does  not  address  the  basis, 
rationale,  or  benefits  and  costs  of  such  a  decision.  It  Is  assumed  that  the 
decision  to  distribute  has  already  been  made.  It  Is  recognized  that  the  basis 
for  making  such  a  decision  has  not  yet  been  adequately  studied.  Some  work  on 
that  topic  Is  being  performed  under  the  same  major  research  program  which  also 
Included  this  particular  project  —  the  Georgia  Institute  of  Technology 
Research  Program  In  Fully  Distributed  Processing  System.  That  work  will  not 
be  reported  on  here.  However,  earlier  as  well  as  other  current  work  In  the 
FDPS  Research  Program  has  presented  persuasive  arguments  as  to  the  requirement 
for  very  loose  coupling,  both  logical  and  physical.  In  large-scale  distributed 
systems.  In  fact,  extremely  loose-coupling  Is  one  of  the  fundamental  system 
concepts  of  Fully  Distributed  Processing  Systems;  and  that  feature  Is  accepted 
as  a  basic  characteristic  of  the  distributed  systems  considered  as  the  target 
for  the  work  performed  under  this  Immediate  study. 

Loosely- coupled  distributed  systems  will  pass  through  the  same  life¬ 
cycle  phases  as  centralized  systems  and  will  require  many  of  the  same  support 
capabilities  as  such  systems.  However,  the  exact  nature  of  loosely-coupled 
distributed  systems  presents  additional  requirements  for  new  support 
capabilities  as  well  as  changes  or  extensions  to  the  support  that  would  be 
provided  for  the  analysis,  design,  implementation,  and  operation  of 
centralized  systems.  The  scope  of  this  study  covers  new  capabilities  as  well 
as  extensions  to  "existing"  ones. 

1.2.1  Extensive  Support  Capabilities  Are  Essential 

There  are  three  major  activities  to  be  supported  by  the  "capabilities" 
being  considered  In  this  report 

e  Designing  a  specific  distributed  system 
e  Producing  the  software  to  Implement  that  system 
e  Operating  that  system 

All  three  of  these  activities  are  greatly  complicated  by  the  basic  charac¬ 
teristics  of  "distribution"  considered  In  Its  broadest  sense. 

e  The  presence  of  multiple  execution  environments,  op>eratlng 
simultaneously  with  almost  total  asynchrony  and  often  non- 
bomogenelty,  creates  perhaps  the  greatest  problems  In  all  three 
activities. 
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•  The  problems  facing  software  development  are  the  same  as  those 
found  In  centralized  unl-processor  systems  with  the  added  com¬ 
plications  of  having  multiple  target  environments  as  well  as 
multiple  development  environments. 

The  only  solution  that  appears  viable  Is  a  well-developed  and  extensive 
set  of  software  and  hardware  support  capabilities. 

1.2.2  Scope  and  Outline  of  This  Project 

The  original  scope  of  work  under  this  project  consisted  of  three  major 

steps. 

Step  J..  Investigate  the  need  for  system  support  capabilities 

a.  Identify  capabilities  required  to  support  design,  analysis, 
Implementation,  utilization,  and  management  control  of  fully 
distributed  loosely-coupled  data  processing  systems. 

b.  Document  the  control  problem  associated  with  each  activity. 

c.  Identify  specific  system  support  capabilities  required  to  sup¬ 
port  each  activity. 

d.  Categorize  the  system  support  capabilities  Identlfed  Into  the 
following  two  categories 

I  -  Most  essential,  urgently  required 
II  -  Secondary  Importance 

and  estimate  resources  and  facilities  required  to  Implement 
each  capability. 

Step  2.  Implement  and  demonstrate  those  "essential”  (l.e..  Category  1}  sup¬ 
port  capabilities  selected/specified  by  the  government. 

Step  3..  Implement  and  demonstrate  those  "secondary"  (l.e..  Category  II)  sup¬ 
port  capabilities  selected/specified  by  the  government. 

Present  plans  do  not  contemplate  the  execution  of  Steps  2  and  3  at  this 
time,  and  the  work  plan  has  been  modified  accordingly.  The  primary  activity 
is  now  Just  the  first  step,  the  Identification  of  systems  support  capabilities 
required.  The  remaining  resources  will  be  utilized  to  perform  the  preliminary 
design  of  some  of  the  most  essential  capabilities. 
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1.:^  THE  LIFE  CICLB  OF  DISTRIBUTED  SYSTEMS 

One  important  aspect  to  consider  in  examining  the  requirements  for  sup¬ 
port  capabilities  is  the  specific  envlronmment  in  which  a  given  support 
capability  is  to  be  employed.  For  the  purpose  of  this  study,  the  application 
environment  will  be  identified  by  reference  to  a  phase  or  a  set  of  phases  in 
the  overall  life  cycle  of  a  distributed  system. 

In  this  study  the  various  phases  or  activities  of  the  life  cycle  are,  in 
chronological  order, 

•  Problem  Analysis  and  Functional  Design 

•  Logical  System  Design 

•  Program  Implementation 

•  Unit  Test 

e  System  Integration 

•  System  Test 

•  Program  Distribution  and  Installation 

•  System  Operation  and  Utilization 

•  System  Maintenance 

Just  as  with  centralized  systems,  to  which  this  list  is  equally 
applicable,  there  are  a  number  of  feedback  paths  present  in  the  complete  life 
cycle. 
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1.4  CATEGORIES  OF  SUPPORT  CAPABILITIES  AMD  THEIR  APPLICATION 

As  this  study  has  progressed,  a  large  number  of  different  system  support 
capabilities  applicable  to  the  total  life  cycle  of  distributed  systems  have 
been  identified.  As  the  list  expanded,  it  became  obvious  that  a  large  amount 
of  confusion  was  being  created  by  the  lack  of  a  clear  definition  of  the 
relationships  between  the  various  capabilities  and  their  specific 
applicability.  A  major  cause  of  this  confusion  was  the  absence  of  a  clear 
distinction  between  the  major  categories  of  support  capabilities.  In  addres¬ 
sing  this  particular  problem,  three  major  types  of  support  capabilities  have 
been  identified: 

e  Software  Development  Support  Tools 
e  System  Design  Support  Facilities 
e  Operational  Support  Capabilities 
1.4.1  Software  Development  Support  Tools 

The  primary  purposes  of  software  development  support  tools  are  the 
production,  maintenance,  and  management  of  the  operational  software  systems, 
both  operating  systems  and  applications  programs  —  l.e. ,  the  production  of 
software.  Some  confusion  is  caused  by  the  word  "software*  in  the  title.  It 
should  be  noted  the  "software"  applies  to  the  application  of  the  tool  or  sup¬ 
port  capability,  not  the  nature  of  the  tool  Itself  since  nearly  all  of  the 
support  capabilities  will  be  Implonented  in  software,  at  least  in  part. 

It  is  unfortunate  that  the  designations  "tool"  and  "support  capability" 
have  been  widely  used  almost  totally  Interchangeably.  (We  have  been  as  guilty 
of  this  as  anyone  else.)  However,  using  the  terms  in  this  manner  was  one  of 
the  major  factors  creating  the  confusion  referred  to  above. 

Because  of  the  wide  variety  of  support  capabilities  found  within  this 
single  category,  further  subcategories  are  useful  in  examining  the  categorize' 
tlon  and  applicability  of  software  support  tools.  The  subcategories 
identified  thus  far  are: 

e  Software  Requirements/Specification  Tools 
e  Software  Design  Tools 

•  Software  Implementation  (Programming)  Tools 
«  Software  Quality  Assurance  Tools 

•  Software  Maintenance  Tools 

•  Software  Cross-Environment  Tools 

•  Miscellaneous  Software  Utility  Tools 
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•  Software  Management  Tools 

It  should  be  noted  that  these  suboategorles  are  equally  applicable  to 
tools  supporting  centralized  systems. 

The  list  of  subcategories  given  above  will  be  utilized  during  this  study 
when  such  subdivisions  are  required;  however,  that  Is  not  the  only  set  that 
has  been  proposed.  William  Howden  In  discussing  software  development 
environments  presents  a  five-way  categorization  [Howden]. 

e  Requirements  Tools  and  Methods 

•  Design  Tools  and  Methods 

•  Coding  Tools  and  Methods 

•  Verification  Tools 

•  Management  Tools  and  Techniques 

Also,  A.N.  Haberman  In  [Riddle  &  Fairley]  discusses  his  two-part  clas¬ 
sification 

•  Program  Development  Tools 

e  System  Construction  Tools 

where  examples  of  the  first  are  the  "classical  tools  such  as  compilers  and 
editors"  while  the  latter  "emphasizes  the  importance  of  specifications  and 
system  version  maintenance." 

Another  categorization  methodology  for  software  development  tools  has 
been  proposed  by  the  Software  Tools  Project  of  the  Institute  for  Computer 
Sciences  and  Technology  at  the  National  Bureau  of  Standards.  This  methodology 
Is  based  on  a  multl-dlmenslonal  taxonomy  of  tool  features  describing  the 
characteristics  of  the  Jjuuiti  the  function,  and  the  output  of  the  tool.  These 
three  major  features  are  further  divided  into  two  or  three  dimensions.  In  all 
there  are  7  dimensions.  In  the  list  below  the  following  notation  is  employed: 

•  Basic  processes  of  a  tool 

Classes  tool  features  -  Classification  dimensions. 

Specific  tool  features  -  multiple  features  in  a  single 
class  may  apply  to  a  given  tool. 

e  Input 

Sublect  (l.e. ,  Main  Input) 

Text 

VHLL  (Very  high  level  language) 

Code 

Data 

Control  Input 
Commands 
Parameters 


T 
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•  Function 

Transformation  (How  Is  the  subject  manipulated) 

Editing 

Formatting 

Instrumentation 

Optimization 

Restructuring 

Translation 

Static  Analysis  (Operations  on  the  subject) 

Auditing 

Comparison 

Complexity  Measurement 
Completeness  Checking 
Consistency  Checking 
Cost  Estimation 
Cross  Reference 
Data  Flow  Analysis 
Error  Checking 
Interface  Analysis 
Management 
Resource  Estimation 
Scanning 
Scheduling 
Statistical  Analysis 
Structure  Checking 
Tracking 
Type  Analysis 
Units  Analysis 

nvnami a  Analysis  (Operations  during  or  after  execution) 
Assertion  Checking 
Constraint  Evaluation 
Coverage  Analysis 
Resource  Utilization 
Simulation 
Symbolic  Execution 
Timing 
Tracing 
Tuning 

•  Output 

Paer 

Computational  Results 

Diagnostics 

Graphics 

Listings 

Text 

Tables 

Machine 

Data 

Intermediate  Code 
Object  Code 
Prompts 
Source  Code 
Text 


•i 
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Although  this  classification  methodology  was  developed  primarily  to  sup¬ 
port,  or  force,  complete  descriptions  of  tools,  it  has  also  been  useful  in  the 
context  of  this  study  to  check  for  completeness  of  coverage  in  our  considera¬ 
tion  of  the  need  for  various  software  development  tools. 


1.4. 1.1  Examples  of  Software  Development  Support  Tools 

Implementation  tools  such  as  conpilers  and  editors  are  certainly  the 
most  common;  however,  there  is  beginning  to  be  significant  activity  in  the 
development  of  support  capabilities  in  the  other  categories  as  well.  In  the 
initial  edition  of  the  "Software  Engineering  Automated  Tools  Index"  published 
by  Software  Research  Associates  the  breakdown  was  as  follows: 

Category  Number  Percentage 


Requirements /Specification  Tools 

20 

3% 

Design  Tools 

47 

7% 

Implementation  Tools 

210 

32% 

Quality  Assurance  Tools 

132 

20% 

Maintenance  Tools 

119 

^B% 

Project  Management  Tools 

57 

9% 

Cross-Environment  Tools 

16 

2% 

Miscellaneous  Dtlllty  Systems 

40 

6% 

Research  and  Development  Systems 

7 

n 

Examples  of  specific  tools  that  fall  in  each  subcategory  are  given  below. 

•  Requlrement/Speolfloation  Tools 

Requirement/Specification  Languages 

Charts  and  Diagrams  (both  formal  and  informal) 

(e.g.,  HIPO,  SADT,  Dataflow,  etc.) 

Specification  Cross-Reference  Analyzer 
Archlver/retrlever  for  requirements  specifications 

•  Design  Tools 

Formal  Design  Tools/Methodologies 
(e.g.,  PDL,  Structured  Design) 

Automated  Data  Dictionary 

Distributed  Data  Base  and  Transactions  Processing  Design  Language 

Module  Interface  Checker 

Module  Cross-Reference  Analyzer 

Automated  Simulator  Builder 

Automated  Archiver  for  Design  Specifications 
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•  Implementation  Tools 

Distributed  Applications  Programming  Languages 
Distributed  Systems  .Implementation  Languages 
Editors 

Text  Managers  (source  and  object  code  file  systems) 

Source  Code  Manager 
Program  Cross-Reference  Analyzer 
Language  Processors 
Compiler  Development  Tools 

«  Quality  Assurance  Tools 
Flow  Charter 
Test  Harnesses 
Test  Coverage  Analyzer 
Test  Data  Generator 
Control  Flow  Analyzer 
Data  Flow  Analyzer 

•  Maintenance  Tools 

Source  Code  Debugging 

Trouble  Report  and  Comment  Tracking  System 

e  Cross-Environment  Tools 
Cross  Compilers 
Environment  Simulators 

e  Miscellaneous  Utility  Tools 
Program  Archiver 

e  Management  Tools 

Project  Status  Control 
Project  Status  Report  Generators 
Build  Plan  Recorders 
Configuration  Manager 
Cost  Estimator 
Version  Manager 

1.4. 1.2  Applicability  of  Software  Support  Tools 

The  applicability  of  the  various  suboategorles  of  tools  to  the  different 
phases  In  the  overall  life  cycle  Is  fairly  obvious  from  the  name  of  each  sub¬ 
category, 

e  Software  Requirements/Speoifioation  Tools 
Problem  Analysis  and  Functional  Design 

e  SofWare  Design  Tools 

Logical  System  Design 

e  Software  Implementation  Tools 
Program  Implementation 

e  Software  Quality  Assurance  Tools 

Unit  Test 
System  Test 
System  Maintenance 

e  Software  Kalntenanoe  Tools 

System  Maintenance 
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e  Croas-Environment  Tools 

Program  Distribution  and  Installation 

e  Mlsoellaneoua  Software  Utility  Tools 

System  Maintenance 

•  Software  Management  Tools 

Syston  Integration 

Program  Distribution  and  Installation 
1.4.2  System  Design  Support  Fapllltlas 

"System  Design  Support  Facilities"  describe  that  collection  of  hardware 
and  software  facilities  utilized  to  support  the  analysis^  design,  testing, 
experimentation,  and  monitoring  of  distributed  processing  systems.  System 
design  support  facilities  provide  information  about  the  distributed  system, 
they  do  not  directly  produce  or  modify  the  software  defining  the  operation  of 
the  system,  nor  do  they  directly  support  the  operation  of  the  system. 

1.4. 2.1  Examples  of  Hardware/Software  Support  Faoilltles 

e  Capacity  Requirement  Estimators 

Computation 

Storage 

Communication 

e  Simulators 
System 

Communications 
Transaction  Processing 

e  Load  Emulators 

e  Nonltors/Performanoe  Measurement 

Resource  Utilization 
!  File  Performance 

e  Testbeds 

e  Redundancy  Requirement  Planner 
e  Fault-Toleranoe  Estimator 
^  e  Database  Designers  Workbench 

1.4. 2. 2  Applicability  of  System  Design  Support  Facilities 

System  Design  Support  Facilities  are  applicable  primarily  to  the 
analysis  and  design  phases  of  the  life  cycle.  The  facilities  are  also  useful 
>  to  support  testing  and  maintenance  as  well  as  management  of  operations. 

1.A.Q  SuDDort  Capabilitias 

"Operational  support  capabilities"  directly  support  the  operation  of  the 
i  distributed  processing  system  and  fu'e  physically  embedded  In  the  operational 

I  software  system.  These  capabilities  provide  those  functions  which  are 

\  "unique"  to  distributed  systems  operations  and  are  usually  found  In  that  poi^ 

< 

i 
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tlon  of  the  system  software  known  as  the  "network  operating  system"  (NOS)  or 
"distributed  operating  system"  (DOS).  (In  the  logical  model  of  system 
software  as  developed  in  the  GIT  FDPS  Research  Program,  those  functions  per¬ 
forming  tasks  similar  to  those  found  In  centralized  systems  are  Included  In 
the  "local  operating  system"  (LOS).  Although  the  LOS  must  interface  with  and 
interact  with  the  NOS/DOS,  the  GIT  model  places  all  support  of  distribution  in 
the  NOS/DOS.  It  should  be  noted  that  current  research  at  Georgia  Tech 
Indicates  that  It  may  be  possible  to  effectively  combine  all  these  operations 
into  a  single  global  operating  system.) 

1. 4.3*1  Examples  of  Operational  Support  Capabilities 

•  Access  Control 

•  System  Command  Language 
e  Workload  Distributor 

e  Resource  Manager 
e  Task  Graph  Manager 

•  Interprocess  Communication 
e  Scheduler 

e  Execution  Manager 
e  File  Manager 

•  Recovery  Manager 

•  Communication  Protocols 

1.4. 3. 2  Applloablllty  of  Operating  System  Capabilities 

Although  operating  system  capabilities  are  primarily  involved  with  the 
operation  of  the  distributed  system,  several  of  them  are  also  applicable  to 
other  phases  of  the  life  cycle  as  shown  in  the  chart  below  in  paragraph  1.5. 
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1.5  APPLICABILITr  OF  SSSfnM  aOPPORT  CAPABTLITINS 

The  matrix  shown  in  Chart  1  indicates  the  primary  applicability  of  the 
various  support  capabilities  identified  in  this  study. 


CNAHT  1 

UTILIZATION  OP  SYSTBM  SUPPOBT  CAPABILITIBS  DURING 
YARIOUa  Lire  CYCLg  ACTITITgS 
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1.6  SUPPORT  CAPABILITIES  AMD  STSTBM  POMCTIQMALITI 

There  are  a  number  of  points  of  view  or  approaches  that  can  be  taken 
when  studying  the  characterlstlost  Interrelationships,  and  relative  develop¬ 
ment  priorities  of  support  capabilities.  The  two  analyses  presented  thus  far 
focused  on  the  nature  or  form  of  the  support  capability  and  on  the  life-cycle 
actlvltydes)  that  each  supports.  Another  very  Important  point  of  view  Is 
analyzing  what  support  capabilities  are  required  or  Implied  by  the  func¬ 
tionality  to  be  provided  by  the  operational  system. 


Distributed  Resouroe  Uttllzetlon 
Access  Control 

Work  Distribution  and  Resouroe  Management 
Distributed  Process  Execution  Manager 
Distributed  Task  Graph  Manager 
Distributed  File  System 

Distributed  Name  Server 
Distributed  Interprocess  Connunlcatlon 

Osar  Servloes 

Distributed  Command  Language 

Distributed  Software  Tools 
Distributed  Compiling  Shell 

Remote  Access 
Mall 

avstam  Support 

Protocols  for  Interprocess  and  Interprocessor  Communication 
Low-level 
Session 

Reliable  Broadcast 
Network  Simulators 
Communication  Requirements  Estimator 

Fault-Toleranoe  Support 

Reoovery/ Rollback  Manager 
Activity  Journalling 
Redundancy  Requirement  Planner 
Fault- Tolerance  Estimator 

Distributed  Data  Base  Operations 

Concurrency  Control  Mechanism 
Transaction  Processing  Manager 
Security  Control  Mechanism  (Multi-Level) 

Application  Design  Tool 
Consistency  Verifier 
Archive  Support 

Transaction  Processing  Simulator 


Trouble/Change  Report  and  Fix  Tracking  System 
Code  Version  Control 
Performance  Heasureoent 
Quality  Metrics 
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1 .7  OTHER  MORE  JS.  THIS  AREA 

"Distributed  systems”  has  become  an  extremely  popular  topic  for 

research.  In  fact,  this  topic  has  probably  become  the  most  popular  research 
area  in  computing  today.  The  large  amount  of  work  being  performed  should  make 
it  very  easy  to  locate  solutions  to  many,  if  not,  most  of  the 

problems/requirements  described  in  this  report.  However,  such  is  not  the 
case. 

When  one  studies  the  work  done  under  the  general  title  of  "distributed 
systems”,  there  is  a  grave  danger  of  misunderstanding  or  misinterpreting  the 
exact  nature  of  the  target  system  involved.  More  specifically,  it  is  usually 
difficult,  if  not  Impossible,  to  determine  the  specific  characteristics  of  the 
system  to  which  the  results  are  applicable,  esDeciallv  with  respect  to  the 
degree  siH  distribution  that  applies.  In  fact,  in  many  Instances  it  is  obvious 
that  the  researcher  has  not  completely  defined  the  specific  class(es}  of 

systems  to  which  his  work  applies.  This  task  then  falls  on  the  reader,  and 

his  analysis  is  then  based  on  Incomplete  data  and  often  erroneous  assumptions. 
The  net  result  of  this  is  that  it  is  usually  difficult,  and  often  impossible, 
to  accurately  determine  the  applicability  of  published  results  in  this  field. 

For  this  reason  we  will  not  attempt  to  survey,  much  less  catalog,  other 
work  done  in  this  eu'ea  on  the  topic  of  support  capabilities  for  distributed 
syst'vss.  However,  there  are  two  activities  that  should  be  mentioned  —  the 
distributed  computing  research  program  of  the  O.S.  Army  Ballistic  Missile 
Defense  Advanced  Technolcgy  Center  (BMDATC),  Huntsville,  Alabama,  and  the 
"Distributed  Processing  Tools  Definition  Study"  performed  by  the  Data  Systems 
Division  of  General  Dynamics  for  the  D.S.  Air  Force,  Rome  Air  Development 
Center  (RADC/ISIE),  Contract  F306 02-8 1-C-O 142. 

1.7.1  BMDATC-P 

This  project  has  been  underway  for  several  years  with  the  definitional 
phase  starting  in  1975.  There  have  been  a  large  number  of  contracts  in  this 
program  covering,  to  varying  degrees,  almost  all  of  the  areas  Identified  in 
this  report.  The  concept  of  an  "Integrated  Tool  Set"  is  proposed  to  support 
the  steps  "requirements  design"  through  "software  design”  and 
"implementation.”  This  program  has  also  developed  and  Installed  a  multi¬ 
computer  testbed  at  Huntsville. 
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The  types  of  target  systems  considered  originally  were  not  quite  as 
loosely-coupled  as  those  addressed  by  Georgia  Tech;  however,  the  focus  has 
changed  some  over  the  years.  The  point  of  contact  Is  BMDATC-P. 

1.7.2  Gen«>».«l-Pvn«M<n»t  (BADCl  Broleot 
This  was  a  three  phase  project: 

I  -  Study  of  hardware/software  technologies  for  embedded  distributed 

processing  systems  (EDPS);  supporting  the  Identification  of  tech¬ 
niques,  requirements,  and  Impacts  for  EDPS  lifecycle  phases. 

II  -  Survey  of  existing  EDPS  tool  inventories;  developing  the  EDPS  life 

cycle  requlr^nents  with  no  near-term  tools. 

III  -  Analysis  of  EDPS  problem  areas;  resulting  In  a  prioritized  list  of 

candidate  technologies  for  R&D  and  estimates  of  the  effort  involved 
In  each  as  well  as  Its  potential  benefit. 

A  variety  of  different  types  or  classes  of  distributed  systems  were 
considered;  however,  the  level  of  detad.1  In  the  discussion  often  does  not 
permit  their  exact  definition.  Heavy  emphasis  was  given  to  object-oriented 
models. 


Section  1 


INTR0DUCT1C»I 


Page  19 


1.8  ORGANIZATIOM  OP  THIS  REPORT 

The  goal  of  this  project  was  to  Identify,  classify,  and  prioritize 
system  support  capabilities  (SSC)  as  they  relate  to  FDPS/loosely-coupled 
Systems.  The  description  and  discussion  of  individual  support  capabilities  is 
organized  based  on  the  nature  or  form  of  the  SSC  Involved.  The  three  sections 
of  the  report  containing  these  descriptions  are; 

Section  2:  Software  Development  Support  Tools 

Section  3:  Distributed  System  Design  Support  Facilities 

Section  4:  Operational  Support  Capabilities 

These  sections  contain  discussion  of  specific  support  capabilities  or 
tools  which  we  believe  should  be  implemented  or  specific  areas  which  should  be 
researched  to  provide  the  basis  for  a  later  implementation.  Each  discussion 
contains  the  following  subsections: 

1.  Short  description  of  the  support  capablllty/tool 

2.  Background  (Why  this  support  capability  is  required) 

3.  Problems  (general  and  specific)  to  be  Solved 

4.  Proposed  solutions  (or  initial  approaches  for  research) 

5.  Relationship  to  other  FDPS  work  and  SSC*s 

6.  Resources  and  schedule 

Because  of  the  very  nature  of  this  document,  certain  SSC’s  are  more  com¬ 
pletely  defined  than  others.  In  most  cases  the  need  or  the  desire  for  a 
specific  support  capability  is  recognized  before  the  exact  means  to  provide  it 
is  determined.  Thus,  the  level  of  each  discussion  v£u?les  with  our  current 
understanding  of  the  problem  and  schemes  to  implement  solutions.  Additional 
work  past  the  definition  stage  has  been  on  several  of  the  support 
capabilities.  This  work  has  been  identified  in  the  discussions  in  Sections 
2,  3,  and  4;and  there  are  several  references  to  detailed  material  included  in 
the  Appendix  to  this  report  which  is  published  as  a  separate  volume. 

Section  5  concludes  the  report  addressing  the  priorities  for  the 
development  of  specific  SSC's  or  groups  of  them. 
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SOFTWARE  DEVELOPMENT  SUPPORT  TOOLS 

2.1  DESIGN  LANGUAGES 

2.1.1  Introduotloa 

Among  the  potential  uses  of  a  fully  distributed  processing  system  is  the 
execution  of  distributed  programs.  Such  programs  consist  of  a  collection  of 
relatively  loosely-coupled  modules  which  together  perform  a  single  logical 
task.  Such  programs  are  termed  "distributed"  programs  because  the  individual 
modules  may  (or  in  some  cases,  must)  be  executed  in  parallel  on  separate  nodes 
of  the  FDPS.  The  possible  motivations  for  such  parallel  executions  Include, 
amofflg  others:  (1)  operating  on  large  quantities  of  data  at  the  nodes  where 
they  are  stored  or  (2)  taking  advantage  of  Inherent  parallelism  in  a  task 
being  Implemented.  As  an  example  of  the  latter  of  these  two  motivations,  a 
distributed  ccxnpller  has  recently  been  Implemented  as  part  of  the  FDPS 
research  project,  in  order  to  study  the  advantages  of  organizing  a  compiler 
using  a  pipeline  structure  and  thus  executing  the  normal  phases  of  a  compila¬ 
tion  in  parallel  [MILL82]. 

The  development  of  software  for  distributed  systems  may  require  some  new 
techniques  throughout  the  software  life  cycle.  One  problem  which  must  be 
addressed  is  how  the  various  parts  of  a  distributed  program  can  be  described 
as  a  single,  conceptually  unified  program.  At  the  programming  language  level, 
few  existing  or  proposed  languages  which  include  features  for  expressing 
parallelism  or  concurrency  provide  much  help.  Infoimiatlon  about  how  the  parts 
of  programs  written  in  these  languages  interact  can  only  be  obtained  by 
detailed  examination  of  the  code  for  the  individual  parts.  One  project 
currently  in  progress  within  our  research  program  is  concerned  with  the 
development  of  a  set  of  language  features  (called  PRONET)  which  allow  the 
description  of  the  interactions  among  a  group  of  "processes"  by  way  of  a 
"network"  specification.  These  specifications  can  be  Interpreted  as  abstract 
descriptions  of  the  communication  behavior  of  the  processes  which  make  up 
distributed  programs  and  thus  provide  at  least  some  of  the  conceptual  unifica¬ 
tion  we  desire.  We  believe  that  it  may  be  useful  to  apply  the  concepts  and 
perhaps  even  the  notation  of  PRCMET  network  specifications  as  the  basis  of  a 
design  language  for  distributed  software. 
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2.1.2  Background 

The  need  for  a  design  phase,  prior  to  the  implementation  phase,  has  long 
been  recognized  ([PARN72],  [LISK72],  and  [WIRT713).  Benefits  which  accrue 
from  such  a  phase  Include  reliability  of  software,  productivity  of  programmers 
and  maintainability  of  software.  The  design  phase  consists  basically  of 
determining  what  services  are  required  of  the  software  and  then  deciding  how 
the  software  is  to  provide  these  services  [PETE81]. 

The  function  of  a  design  language  is  two-fold  [JENS78].  First,  a  design 
language  should  allow  the  software  designer  to  specify  his  ideas  in  a  form 
that  will  be  understood  by  other  people.  This  is  especially  Important  in 
cases  where  more  than  one  person  is  working  on  a  project.  The  design  language 
becomes  the  medium  in  which  the  designers  communicate.  Second,  the  design 
language  should  be  machine  processable.  Programs  can  then  be  written  which 
analyze  the  design.  In  this  way,  inconsistencies  in  the  design  can  be  caught 
by  the  design  tools  and  brought  to  the  attention  of  the  designers. 

Much  work  has  been  done  in  developing  methodologies  and  notations  for 
representing  program  design.  A  few  include  PDL  [CAIN75],  HOS  [HAMI76],  SARAH 
[ESTR78],  TOPD  [SN0W78],  and  FLEX  [SUTT81]. 

2.1.3  froblMB 

The  model  of  computation  proposed  for  a  FDPS  is  that  of  a  co-operating 
network  of  processes  [LEBL81].  These  processes  are  Independent  in  the  sense 
that  no  processes  control  the  behavior  of  another  directly.  Instead,  the 
processes  conmunlcate  by  passing  messages  to  one  another.  These  messages 
allow  the  programs  to  exchange  data  and  to  coordinate  their  behavior. 

This  model  allows  the  designer  of  distributed  software  to  take  advantage 
of  any  parallelism  available  in  the  problem  he  is  attempting  to  solve.  Also, 
the  goal  of  breaking  the  problem  into  smaller,  relatively  Independent  subunits 
fits  in  well  with  traditional  concepts  of  software  design,  for  example, 
information  hiding  [PARN72].  However,  the  model  also  provides  new  challenges 
for  the  software  designer. 

One  is  that  the  designer  must  explicitly  design  the  process  structure  to 
take  advantage  of  any  parallelism  in  the  problem.  He  must  design  so  that 
Independent  actlmis  are  performed  by  separate  modules.  The  designer  can  no 
longer  think  in  terms  of  shared  memory.  Instead,  the  components  of  his  design 
exchange  data  by  message-passing.  The  designer  Is  also  responsible  for  synch- 
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ronlzing  the  efforts  of  the  v2irious  processes,  at  least  at  a  conceptual  level. 

Cleeirly,  the  designer  of  distributed  software  could  benefit  from  a 
design  language  which  would  allow  him  to  easily  specify  the  above. 
Unfortunately,  most  existing  design  languages  and  methodologies  do  not  meet 
the  above  goals.  For  the  most  part,  they  assume  a  hierau'ohy  of  program 
control.  That  is,  there  is  some  single  component  in  the  design  which  is 
responsible  for  coordination  of  all  the  others.  The  tools  that  they  provide  for 
specifying  the  synchronization  and  communication  of  the  components  in  the 
design  are  not  adequate.  Another  concern  of  distributed  software  is 
reliability  ~  what  happens  to  the  program  as  a  whole  if  there  is  a  failure  at 
one  of  the  machines  in  the  system.  The  design  language  should  allow  the 
designer  to  take  this  into  account. 

2.1.4  Proposed  Solutions 

Efforts  made  for  support  of  distributed  softweu'e  have  already  appeared 
in  the  form  of  implementation  languages  ([BRIN78J  and  [HOAR78]).  Here  at 
Georgia  Tech,  implementation  of  a  compiler  for  PRONET  is  almost  complete. 
(PR(»IET  is  described  in  Appendix  C.)  The  language  consists  of  two  parts:  a 
process  specification  notation  and  a  network  specification  notation.  The 
subunits  would  be  Implemented  using  the  former  notation;  the  interactions 
between  the  subunits  would  be  described  using  the  latter  notation.  Effort  in 
the  direction  of  design  l^mguages,  however,  has  been  lacking. 

Recent  work  on  the  use  of  Ada  as  a  design  language  is  also  relevant  to 
this  problem.  (See  [B00C81]  and  various  reports  in  Ada  Letters.)  The  package 
and  tasking  facilities  in  Ada  provide  the  designer  with  the  ability  to  express 
a  program  as  abstract  objects.  Thus,  he  can  hide  information  and  specify  the 
cooperative  aspect  of  the  subunits. 

Other  work  done  specifically  for  distributed  '•.oftware  includes  that 
described  in  [YAU81].  Here,  the  data  and  functional  specifications  are 
considered  separately.  The  program  is  broken  up  into  components  which 
Interact  only  through  shared  resources  or  messages.  However,  the  methodology 
used  in  this  approach  requires  that  interactions  across  processor  boundaries 
be  identified.  This  aspect  of  the  work  does  not  go  along  with  the  gener2d 
philosophy  of  an  FDPS.  That  is,  the  user  of  a  FDPS  should  be  able  to  view  the 
system  as  one  unified  system,  and  that  the  system  itself  should  normally  be 
responsible  for  where  work  occurs. 
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We  propose  to  add  the  concept  of  an  abstract  description  of  the  com¬ 
munication  behavior  of  parallel  processes  to  existing  design  language 
concepts.  The  existing  design  language  would  be  used  to  specify  the  com¬ 
positions  of  the  subunits.  Network  specifications  like  those  of  PRONET  would 
indicate  how  these  subunits  Interact.  In  addition,  since  there  is  a  compiler 
underway  for  PRONET,  a  processor  for  the  design  language  could  utilize  much  of 
the  work  done  there. 

2.1.5  Relationship  ^  £I2£fi  Work 

The  FDPS  project  consists  of  many  interrelated  projects.  Work  in  this 
area  includes  the  implementation  of  a  network  operating  system,  a  distributed 
system  test-bed,  a  distributed  execution  monitor,  and  a  distributed  debugger. 
These  projects  represent  major  efforts  in  software  engineering.  The 
assistance  of  a  design  language  in  developing  the  programs  would  be  invaluable 
in  producing  reliable,  ccxnprehenslble  systems.  Work  on  a  design  language  and 
a  methdology  for  its  use  can  proceed  Independently  of  these  other  projects. 

2.1.6  Reflouroea  and  Schedule 

The  major  tasks  required  for  this  project  are  to  survey  these  existing 
design  languages,  to  choose  one  which  is  most  compatible  with  our  concept  of 
Independent  processes  communicating  by  message  passing,  and  to  Integrate  the 
PRCWET  specification  concepts  with  that  design  language.  This  integration 
will  Include  adding  some  consideration  of  reliability  in  the  presence  of  node 
failures.  Such  a  concept  again  goes  beyond  those  found  in  typical  design 
languages. 


To  cover  a  12  month  period; 

Manpower 

Senior  Staff 
(2  m-m/year) 
Junior  Staff 
(6  m-m/year) 
Programmer 

(0  m-m/year) 
Secretarial  Support 
(1  m-m/year) 


man-months 

2 

6 

0 

1 
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Equipment 

Computer  Time 


moderate  usage 
(for  designing  syntax  of 
new  design  language) 


Timing 

First  period  of  3  months: 

Examination  of  existing  design  languages 

L^l8t  period  of  9  months: 

Extension  of  a  design  language  to  Include 
new  concepts  described  above 
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2.2  LABOniOR  SUPPORT  £Qll  amSL  DISTRIBUTED 
2.2.1  mnr  a  Language  for  Diatrlbnted  ADPHoationa? 

Initially,  high-level  language  work  on  distributed  systems  Involved  the 
development  of  "clever”  compilers  for  traditional  languages.  These  compilers 
partitioned  the  code  produced  tor  distributed  execution  In  a  way  which  was 
more  or  less  transparent  to  the  user.  However,  we  feel  that,  since 
distributed  systems  Involve  new  models  of  computation.  It  Is  appropriate  to 
design  new  languages  which  provide  primitives  more  suited  to  these  new  models. 
Examples  of  such  new  language  features  may  be  found  In  CSP  ([Hoar78]),  DP 
([Brln78]),  PLITS  ([Feld79]),  and  ARGUS  ([Llsk79]  and  [Llsk82]). 


New  features  aiding  the  design  and  description  of  distributed  programs 
are  central  to  the  design  of  PRGNET  ([Macc82]),  a  language  currently  being 
implemented  at  Georgia  Tech.  The  new  capabilities  developed  for  this  language 
are  being  added  to  Pascal  as  a  base  language,  but  since  they  Involve  only 
Interprocess  communication  and  Interconoeotion  of  processes  via  message  chan¬ 
nels,  they  could  be  added  to  many  other  languages. 

Among  the  important  features  of  PRCRIET  are  the  abstraction  capabilities 
which  it  provides  f<x*  the  specification  of  networks.  Network  specification 
and  process  description  are  separated  In  PRWET  by  the  division  of  the 
language  facilities  Into  two  sublanguages:  NETSLA  (Network  Specification 
Language)  and  ALSTEN  (an  extended  Pascal  for  process  description).  These 
capabilities  enable  the  Interactions  between  processes  to  be  encapsulated, 
aiding  In  the  understanding  of  complex  programs  and  providing  Information  to 
the  distributed  operating  system  needed  tor  making  placement  and  scheduling 
decisions.  A  further  description  of  these  aspects  of  PRONET  can  be  found  In 
Appendix  C. 

PRONET  also  provides  features  which  take  advantage  of  the  capability  of 
distributed  systems  for  graceful  degradation.  These  features  allow  r''  j'very 
from  mechanical  failures  In  a  network.  Appendix  D  provides  an  overview  of 
these  features. 

2.2.2  JBbbULNHL  ^  ^llN.  JZMlgR  ilC  WWRT 

During  the  course  of  work  on  PRONET,  areas  for  further  study  have  been 
identified.  In  psrtloular,  our  experience  has  shown  that  PRONET  lacks 
faoilltlss  for  providing  the  robustness  In  the  face  of  algorithmic  failures 
(due  to  flaws  In  software  design),  which  Is  a  desired  property  of  distributed 
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programs  for  some  applications.  This  problem,  and  proposed  approaches  for  its 
solution,  are  examined  in  detail  below. 

Some  other  questions  generated  by  the  design  and  Implementation  of 
PRONET  which  have  been  Identified  are: 

e  The  process  description  language  ALSTEN,  being  based  In  Pascal, 

Is  necessarily  simple.  Better  abstraction  facilities,  such  as 
those  provided  by  Ada,  may  be  useful; 

«  Process  execution  Is  currently  straightline  sequential.  The 
"actors”  model  of  process  execution  may  be  more  appropriate  to 
the  asynchronous  nature  of  the  Interprocess  communication  model 
employed  In  PRONET; 

e  The  NETSLA  sublanguage  needs  more  Information  about  physical 
network  attributes.  Information  about,  say,  the  node  at  which  a 
user  Is  most  often  located,  or  which  nodes  are  nearest  to  the 
user,  would  be  useful  In  scheduling; 

•  The  Interprocess  connections  provided  by  NETSLA  are  currently 
unintelligent.  The  provision  of  pipe  transfonns  and  consistency 
checks  In  these  interconnections  should  be  attempted; 

«  Problens  have  been  encountered  In  Interfacing  the  PRCRIET 
Implementation  with  the  local  operating  system.  In  particular, 
the  distinction  between  the  command  language  of  the  OS  and  the 
NETSLA  sublanguage  (which  Is  a  command  language  of  sorts)  Is 
hazy; 

e  The  run-time  support  routines  of  PRONET  at  present  subsume  most 
of  the  functionality  required  of  a  distributed  OS.  The  use  of 
operating  system  primitives  appropriate  to  the  implementation  of 
the  new  facilities  provided  by  PRONET  would  be  helpful.  The 
Implementation  of  the  language  on  a  distributed  system  running 
under  such  an  operating  system  should  be  easier  when  attempted. 

2.2.3  The  Problem  ot  Fellur* 

Failures  In  distributed  systems  are  of  two  varieties:  mechanical 
(failures  in  system  hardware)  and  algorithmic  (failures  due  to  software  errors 
or  design  flaws).  Schemes  for  dealing  with  f8d.lures  have  recently  been  sur¬ 
veyed  by  Kohler  ([Kohl8l]).  As  has  been  mentioned  above,  PR(R(ET  provides 
extensive  features  for  aiding  recovery  from  mechanical  failures.  However,  the 
problmn  of  algorithmic  failure  has  yet  to  be  addressed  In  PRONET. 

Methods  for  treating  algorithmic  failure  have  been  surveyed  by  Randell 
([Rand79]).  He  divides  these  schemes  Into  the  so-called  forward  and  backward 
automatic  recovery  schemes.  In  forward-recovery  methods,  predictions  about 
the  location  and  consequences  of  software  errors  are  necessary,  and  thus  these 
methods  are  not  suitable  for  treating  errors  caused  by  design  faults.  The 
exception-handling  methods  used  In  languages  such  as  PL/ I  and  Ada  are  forward- 
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recovery  methods,  and  are  i>sed  mainly  for  anticipated  errors  or  conditions 
such  as  faulty  data  or  overflow. 

Backward-recovery  schemes,  on  the  other  hand,  assume  no  previous 
knowledge  of  the  location  or  nature  of  faults.  Rather,  backward  recovery  is 
analogous  to  mechanical  backups  in  hardware  systems.  Information  about  the 
system  state  previous  to  the  fault  is  restored  from  a  checkpoint,  and  a  back¬ 
up  process  is  started.  The  back-up  process  is  necessarily  not  the  same  as  the 
failed  process,  as  It  would  only  fail  again.  In  general,  the  back-up  process 
(or  processes)  is  more  simple  than  the  original  process,  and  may  provide  only 
a  primitive  simulation  of  the  functions  of  the  original  process  (such  as  for¬ 
warding  messages)  in  order  to  keep  a  network  going. 


The  recovery-block  scheme  described  by  researchers  at  the  University  of 
Newcastle-upon-Tyne  ([Shrlv79],  [ShrivSI])  is  an  example  of  a  backwards- 
recovery  method.  The  syntax  for  describing  a  sequence  of  recovery  blocks  is: 


assure  <acceptance  test>  by 
<orlginal  block> 
else  by 

<back-up  block  1> 
else  by 


else  error; 


where  some  of  the  "back-up  blocks"  may  be  simple  retries  of  previous  blocks. 

If  a  failure  occurs  in  the  original  block,  back-up  blocks  are  tried  until  one 
completes  without  failure  and  the  acceptance  test  is  satisfied,  or  else  an  error 
is  signalled.  The  back-up  blocks  may  have  to  undo  permanent  effects  made  by 
their  predecessors  before  doing  their  own  work. 


Problems  in  the  implementation  of  recovery  blocks  include  the  selection 
of  checkpoint  intervals  and  of  appropriate  points  at  which  previously  check- 
pointed  information  may  be  discarded  ([RussSO]).  The  discarding  of  checkpoint 
information  is  equivalent  to  "commitment"  to  the  results  of  the  checkpointed 
block. 


In  another  recent  paper  from  Newcastle-upon-Tyne  ([Crls82]),  Crlstlan 
notes  that  a  mixed  strategy  of  on-unlts  and  recovery  blocks  is  necesseu'y  to 
obtain  highly  reliable  software. 
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A  possible  strategy  which  shotild  be  attempted  in  adding  algorithmic- 
failure  recovery  mechanisms  to  PRONET  is  the  notion  of  "overlaying"  a  back-up 
process  on  the  address  space  of  its  failed  predecessor  ([refs??]).  This 
scheme  would  have  the  additional  advantage  of  allowing  transparent  replacement 
of  existing  permanent  network  processes.  Old  software  could  be  replaced  at  an 
appropriate  time  (say,  at  a  checkpoint)  by  overlaying  a  new  version  on  the 
address  space  of  the  old  software,  without  having  to  halt  the  entire  program. 

Considerable  further  study  of  the  reliability  issue  is  required.  It  is 
not  clear  at  present  whether  extremely  complex  programming  conventions  will  be 
necessary  in  the  framework  of  PRC^ET  to  achieve  reliability.  Thus,  it  may  be 
more  appropriate  to  design  completely  new  programming  languages  if  such  com¬ 
plexity  is  not  considered  acceptable. 

2.2.4  PEfiPoaed  Solution 

Appendix  F  provides  a  more  complete  survey  of  software  fault  tolerance 
techniques  and  some  proposed  research  directions. 

2.2.5  Relationahin  XH.  Other  FDPS 

Other  major  projects  in  the  Georgia  Tech  FDPS  project  include  the 
development  of  an  operating  system  for  an  FDPS.  The  availability  of  a 
language  which  supports  the  clear  expression  of  Interprocess  communication 
relationships  and  provides  information  for  the  peurtltioning  of  programs  to 
execute  in  a  distributed  manner  should  prove  quite  useful  in  the  implementa¬ 
tion  and  operation  of  such  an  operating  system.  Also,  it  is  clear  that 
language  facilities  supporting  transparent  error  recovery  should  be  useful  in 
the  implementation  of  software,  such  as  operating  system  components,  which 
require  high  reliability,  as  well  as  to  user  programs. 

2.2.6  Reaouroea  and  Schedule 

To  cover  an  18  month  period: 

Manpower  man-months 

Senior  Staff  4.5 

(3  m-m/year) 

Junior  Staff  9 

(6  m-m/year) 

Programmers  9 

(6  m-m/year) 

Secretarial  Support  1 .5 

(1  m-m/year) 
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Equipment 

Computer  Time  Substantial 

Timing 

First  period  of  9  months: 

Transport  and  cmplete  current  PRONET 
Implementations;  design  new  language  features. 

Last  period  of  9  months: 

implement  and  evaluate  new  features. 
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2.3  gOMPILBB  DEYELOPMEMT  TOOLS 

If  distributed  systems  constructed  of  heterogeneous  computers  are  to  be 
useful  as  unified  systems  (rather  than  Just  as  communication  networks  like 
ARPANET),  the  compilers  available  to  programmers  must  be  far  more  compatible 
with  one  another  than  those  currently  supplied  by  vendors.  For  example,  com¬ 
pilers  for  the  same  language  on  different  machines  must  accept  exactly  the 
same  features  and  Implement  them  consistently.  Further,  the  user  Interfaces 
to  these  compilers  should  be  consistent.  These  requirements  Imply  that  the 
development  of  a  heterogeneous  distributed  system  In  which  the  differences 
among  the  machines  are  largely  Invisible  to  users  will  require  the  development 
of  families  of  new  compilers  which  are  not  machine-dependent  any  more  than 
necessary. 

We  use  the  term  "families"  of  compilers  because  of  the  techniques  we 
envision  for  their  construction.  A  family  of  compilers  for  a  particular 
language  will  be  said  to  exist  on  a  distributed  system  when  the  compilers  for 
that  language  on  the  various  machines  all  use  a  comnon  machine-independent 
front-end.  (A  front-end  la  that  portion  of  the  compiler  which  Inputs  the 
source  program  and  translates  it  to  some  intermediate  form  (IMF).)  This  shar¬ 
ing  of  a  front-end  leads  us  to  think  of  the  group  of  compilers  as  a  family. 

The  development  of  a  conpller  Is  usually  considered  to  be  a  complex  and 
expensive  task.  It  will  be  necessary  to  build  a  powerful  collection  of  com¬ 
piler  generation  tools  in  order  to  make  our  Idea  of  developing  new  compilers 
for  a  distributed  system  practical.  The  following  sections  describe  some 
areas  In  which  work  might  be  done  to  facilitate  the  creation  of  such  tools. 
The  net  result  of  this  work  should  be  a  tool  set  which  provides  the  maximum 
possible  support  for  the  generation  of  compilers  usable  on  a  distributed 
system. 

Most  of  the  work  on  this  project  will  be  software  development  rather 
than  research.  It  should  be  possible  to  draw  on  many  related  but  less  com¬ 
prehensive  development  efforts.  Some  of  these  are  described  in  [Lanc763, 
[Cole74],  [Gyll  ]  and  CBasl75].  The  comprehensive  compiler  development  system 
described  recently  In  CRudm82]  even  goes  beyond  many  of  the  needs  we 
anticipated,  since  It  Is  oriented  toward  the  development  of  extremely  leu'ge 
programs. 
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2.3.1  Front-end  Generation 

We  have  already  developed  progra&a  which  generate  table-driven  parsers 
and  scanners  from  formal  syntactic  specifications.  Using  these  tools,  It  Is 
quite  stralghtforw^u?d  to  Implement  a  machine-independent  front-end  of  a  com¬ 
piler  for  any  programming  language.  A  single  such  front-end  could  be  used  as 
part  of  any  number  of  compilers  for  a  particular  language  running  on  different 
machines,  thus  taking  an  Important  first  step  toward  solving  the  compatibility 
problems  discussed  above. 

There  are  two  tasks  which  would  Improve  the  utility  of  these  tools: 

Task  1 :  The  parser  and  scanner  generators  should  be  combined  Into  a  single 
tool,  along  with  some  mechanism  to  support  the  automation  of  IMF  generation 
as  a  result  of  parser  actions. 

Task  2:  Practical  front-ends  require  the  Inclusion  of  some  mechanism  to  han¬ 
dle  syntax  errors  discovered  by  the  parser.  A  number  of  such  mechanisms 
have  been  proposed  and  we  have  Implemented  one.  An  evaluation  of  the  prac¬ 
tical  factors  (such  as  time  and  space  requirements)  of  the  veu'lous  alter¬ 
natives  should  prove  useful.  In  order  to  choose  the  appropriate  technique 
for  Inclusion  In  our  compiler  generation  system. 

2.3.2  AutflMtflrt  ii&dft  Qenerator 

In  addition  to  the  families  of  compilers  previously  described,  another 
kind  of  family  may  also  exist  on  a  dl8tid.buted  system.  If  compilers  for  more 
than  one  language  share  the  same  IMF,  then  a  slni^e  code  generator  can  serve 
to  finish  the  compilation  Job  (translating  IMF  to  machine  code)  on  each  kind 
of  machine. 


Considerable  work  has  been  done  In  recent  years  concerning  the  automa¬ 
tion  cf  code  generation.  While  none  of  the  results  are  oonmonly  used  In 
production  compilers,  considerable  progress  has  been  made.  We  would  like  to 
evaluate  these  efforts  In  order  to  see  If  any  can  practically  minimize  the 
work  needed  to  generate  the  families  of  compilers  we  envision.  This  evalua¬ 
tion  will  Include  the  Implementation  ofat  least  one  such  system. 


2.3.3  A.  OaiiM.atnt» 

Lacking  any  such  automated  code  generation  capability,  we  have  construc¬ 
ted  one  code  generation  tool  to  facilitate  compiler  implementation  on  our 
PRIME  computers.  A  general  code  generator  which  accepts  a  symbolic,  tree- 


structured  Intermediate  language  Is  currently  In  use  for  the  Implementation  of 
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several  compilers.  This  program  was  designed  to  generate  high-quality  code  by 
means  of  a  complex  case  analysis.  While  no  automated  techniques  went  into  the 
implementation  of  this  code  generator,  its  availability  in  a  form  which  can  be 
used  by  any  compiler  writer  essentially  eliminates  dealing  with  machine- 
specific  details  from  our  compiler  efforts.  Similar  generators  for  other 
machines  would  considerably  simplify  the  envisioned  compiler  development 
efforts. 

An  attempt  should  be  made  to  retarget  this  code  generator  so  that  it 
generates  code  for  the  VAX.  From  this  effort,  we  should  learn  about  machine- 
dependencies  in  the  IMF  it  uses  and  we  will  obtain  a  capability  to  construct 
families  of  compilers  for  the  VAX  and  the  PRIME  which  share  a  front-end. 

2.3.4  Dnifieatlon  of  Compiler  Tools 

The  previous  sections  have  discussed  work  involving  a  variety  of  com¬ 
piler  tools.  All  of  the  tools  are  based  on  separate  theoretical  developments 
and  thus  work  in  different  ways.  Compilation  and  all  of  its  intermediate 
phases  are  basically  translation  tasks,  that  is,  transforming  some  input 
language  to  a  different  output  language.  It  thus  seems  feasible  to  build  a 
higher-level  compiler  generation  tool  which  would  allow  similar  specification 
techniques  to  be  used  to  automate  the  construction  of  as  many  of  the  phases  of 
a  ccxnpller  as  possible. 

2.3.5  Relationship  ^  Other  FDPS  Work 

No  other  FDPS  development  projects  are  dependent  on  this  work.  Rather, 
it  is  targeted  toward  enhancing  the  usability  of  an  FDPS.  The  development  of 
a  unified  set  of  tools  would  greatly  simplify  the  construction  of  the  required 
families  of  compilers. 


2.3.6  BMQUTQM  jUUL  Sohedule 

To  cover  a  12  month  period: 

Manpower  man-months 

Senior  Staff  3 

(3  m-m/year) 

Junior  Staff  3 

(3  m-m/year) 

Programmers  12 

(2  at  6  m-m/year) 

Secretarial  Support  1 

(1  m-m/year) 
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Equipment 

Computer  Time  Moderate 

Timing 

First  period  of  6  months: 

Existing  tools  should  be  adapted  to 
run  on  VAX  and  Prime. 

Last  period  of  6  months: 

Development  of  unified  tool  concept; 
Implementation  of  automated  code  generators. 
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2.4  COMPILATION  TECHHlttP£S  m  DISTRIBUTED  mmiSS. 

2.4.1  Introduction 

In  the  previous  section,  techniques  for  implementing  families  of  com¬ 
pilers  were  discussed.  This  section  examines  problems  which  result  from  the 
fact  that  these  compilers  must  compile  distributed  programs.  That  is, 
programs  will  be  constructed  from  separate  parts,  which  will  undoubtedly  be 
compiled  separately  and  may  not  ever  all  exist  on  a  single  machine.  Thus  com¬ 
pilation  techniques  must  be  developed  for  the  language  features  which  allow 
the  specification  of  such  programs.  These  new  techniques  may  well  involve 
interactions  with  linkers  and  with  the  operating  system  in  ways  quite 
different  from  what  is  common  on  current  systems. 

The  implementation  of  PRONET  will  provide  our  first  experience  with 
these  problems.  The  prototype  Implementation,  currently  in  progress,  will  not 
deal  with  these  problems  in  their  full  generality  since  we  do  not  yet  have  a 
distributed  operating  system  that  can  provide  all  of  the  necessary  support. 
Once  such  an  operating  system  is  available,  some  effort  should  go  into  a  study 
and  implementation  of  separate  compilation  In  a  distributed  system. 

2.4.2  R9iAiiQBaULo  lo.  Othw  Muck 

Any  FDPS  project  which  will  Involve  the  development  of  a  distributed 
program  will  benefit  from  this  work. 

2.4.3  HflaQUTQgg  AOl  SohiBdiUe 

To  cover  a  6  month  period: 

Manpower  man-months 

Senior  Staff  1 

(2  m-m/year) 

Junior  Staff  1 

(2  m-m/year) 

Programmers  3 

(6  m-m/year) 

Secretarial  Support  1 

(1  m-m/year) 

Equipment 


Computer  Time 


Substantial 
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2.5  giaiapim  compilbus 

To  completely  utilize  a  fully  distributed  processing  system  (FDPS), 
distributed  versions  of  standard  systems  programs  should  be  developed  to  take 
advantage  of  any  Inherent  parallelism.  Such  distributed  programs  have  the 
potential  of  greatly  reduced  response  times.  For  the  following  reasons  com¬ 
pilers  are  an  ideal  case  in  point:  (1)  Compilers  have  a  high  degree  of 
Inherent  parallelism,  (2)  they  are  hl{^ly  utilized  system  programs,  and  (3) 
they  are  currently  slow  in  compiling  large  programs.  We  therefore  propose  to 
develop  practical  distributed  compilers  to  be  used  in  an  FDPS. 

2.5.1  Baokgpound 

As  a  first  step  in  this  direction,  a  distributed  compiler  for  a  small 
language  called  Jigsaw  was  Implemented  (Jigsaw  has  if  and  while  control  struc¬ 
tures,  integer,  real,  array,  and  record  data  structures,  and  parameterized 
procedures).  The  purpose  of  this  initial  reseeu’ch  was  to  test  the  feasibility 
of  distributed  compilation  and  to  get  a  handle  on  potential  response  time 
Improvements. 

The  implementation  of  this  distributed  compiler  consisted  of  partitioning 
the  compilation  task  into  three  aubtasks,  a  lexical  analyzer,  a  syntactic 
analyzer,  and  a  semantic  analyzer  which  includes  a  code  generator.  Code 
generation  was  Included  in  the  semantic  analyzer  because  of  the  simplicity  of 
the  target  code.  In  practical  distributed  compilers  the  semantic  analysis  and 
code  generation  would  likely  be  split  into  separate  subtasks.  The  three  sub¬ 
tasks  were  Implemented  as  distributed  processes  that  communicated  with  each 
other  by  sending  and  receiving  messages.  The  three  processes  along  with  their 
communication  links  constitute  a  pipeline  where  the  source  code  is  succes¬ 
sively  manipulated  by  each  process  to  produce  the  target  (object)  code. 
Diagram tl cal ly  the  compiler  looks  like  the  following: 


source  |  lexical  |  tokens  I  syntactic  i  actions  I  semantic  I  target 
>I  analyzer  j— — — >|  analyzer  |— — — >|  analyzer  I — - > 

7 

I  tokens  strings  I 


In  structuring  the  compiler  this  way,  the  three  processes  can  execute 
with  a  high  degree  of  independence.  For  example,  the  lexical  analyzer 
iteratively  reads  lines  of  source  code,  generates  the  corresponding  token  num- 
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bers  and  token  strings,  and  sends  them  to  the  syntactic  and  semantic  analyzers 
respectively.  At  the  same  time  the  syntactic  analyzer  will,  as  soon  as  It  has 
token  numbers  In  Its  receive  queue,  begin  generating  action  numbers  to  send  to 
the  semantic  analyzer.  The  semantic  analyzer  will  be  similarly  executing  at 
the  same  time.  As  long  as  the  receive  queues  do  not  become  empty  or  full  the 
processes  will  not  have  to  wait  on  each  other.  Under  these  Ideal  conditions, 
if  we  assume  communications  delays  are  negligible  and  that  the  three  processes 
require  an  equal  amount  of  processor  time  to  perform  their  analyses,  the 
response  time  on  an  unloaded  system  would  be  1/3  that  of  a  nondlstributed  com¬ 
piler.  Obviously,  this  is  a  substantial  improvement.  If  It  can  be  realized. 

To  test  this  distributed  compiler,  it  was  run  on  the  School  of  Informa¬ 
tion  and  Computer  Science’s  network  of  PRIME  computers.  The  lexical  and 
syntactic  analyzers  were  run  on  PRIME  550 *s  while  the  semantic  analyzer  was 
run  on  a  PRIME  400.  The  compiler  was  timed  on  test  programs  varying  In  size 
from  25  to  1200  lines  of  code  and  compared  to  the  timing  res\>lts  of  a  non- 
dlstrlbuted  single  pass  version  that  was  constructed  from  the  same  basic  com¬ 
ponents.  For  an  unloaded  system  we  found  the  response  time  of  the  distributed 
compiler  to  be  1/2  to  1/2.5  that  of  the  nondlstributed  version.  For  light  to 
moderate  loads  the  response  time  of  the  distributed  version  was  about  1/1.5 
that  of  the  nondlstributed  version.  These  results  clearly  demonstrate  the 
feasibility  and  desirability  of  distributed  compilation.  A  more  complete 
description  of  this  work  can  be  found  In  Appendix  G. 

2.5.2  PrQMaan  AOSL  Propoaed  Solutions 

The  next  step  Is  to  develop  practical  distributed  compilers  for  full 
scale  languages  such  as  Pascal.  Such  a  distributed  compiler  would  have  the 
same  basic  structure  as  the  one  we  Implemented  and  could  be  constructed  using 
the  tools  mentioned  In  section  2.3.  However,  three  additional  complications 
must  be  considered.  First,  the  semantic  analysis  needs  to  be  separated  from  the 
code  generation  using  two  processes  Instead  of  one.  Second,  the  coordination 
of  useful  error  messages  from  each  of  the  component  processes  necessitates 
more  interaction  between  the  processes.  Finally,  In  a  real  FDPS  the  execution 
of  a  distributed  compiler  would  be  under  the  control  of  a  distributed  operat¬ 
ing  system. 
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2.5.3  Ralationahip  Xst  iUJuc  FDPS  MotiL 

A  practical  distributed  compiler  will  need  to  be  interfaced  with  a 
distributed  operating  system.  The  distributed  operating  system  should 
facilitate  the  Initiation  of  a  compilation  with  a  simple  command  and  should 
schedule  the  ccmponent  processes  on  the  appropriate  processors  so  as  to 
optimize  some  measure  of  system  performance.  If  the  distributed  operating 
system  could  find  enough  available  processors  to  quickly  schedule  all  of  the 
compiler's  component  processes,  then  response  time  Improvements  like  those 
observed  in  our  Implementation  of  the  Jigsaw  distributed  compiler  in  the 
unloaded  case  are  quite  possible  in  other  situations.  Thus,  in  its  later 
stages,  this  research  will  depend  on  the  work  being  done  on  global  operating 
systems.  In  the  meantime,  such  interaction  is  not  necessary. 

It  is  our  hope  that  work  with  a  distributed  compiler  will  provide  us 
with  insight  into  the  general  problem  of  distributed  software.  This  project 
should  also  Include  some  effort  toward  generalizing  the  concepts  used  in 
constructing  the  distributed  compiler. 

2.5.4  Resource  jugL  Schedule 

To  cover  a  12  month  period: 

Manpower 

Senior  Staff 
(2  m-m/year) 

Junior  Staff 
(3  m-m/year) 

Programmers 
(6  m-m/year) 

Secretarial  Support 
(2  m-m/year) 

Equipment 

Computer  Time  Substantial 


man-months 

2 

3 

6 

2 
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2.6  SOFTWARE  VERSION  MAMAQEMBHT 

A  distributed  software  system  consists  of  modules  developed,  updated, 
and  maintained  (perhaps  simultaneously)  by  several  people,  possibly  at 
different  nodes  in  the  system.  A  system- wide  version  control  system  is 
required  in  order  to  maintain  consistency  following  mod '<fl cations  originating 
at  any  node  and  to  permit  the  recovery  of  previous  versions. 

2.6.1  Basie  Version  Control  System 

As  a  minimum,  the  version  control  system  should  maintain  current  and 
previous  versions  of  both  source  and  object  code.  This  provides  the 
capability  to  recover  from  unfortunate  modifications  by  returning  to  the 
unmodified  version.  A  version  control  system  is  necesseury  if  system 
maintenance  is  permitted  to  occur  locally. 

The  distributed  version  control  system  will  be  Implemented  as  a 
distributed  data  base.  The  extension  of  our  basic  software  management  system 
to  provide  these  minimum  capabilities  for  our  FDPS  should  be  accomplished 
without  significant  difficulty.  The  most  Important  consideration  Involves  the 
case  when  modifications  to  the  system  are  taking  place  simultaneously  at 
different  nodes.  If  both  changes  involve  the  same  module,  there  is  little 
threat  to  system  consistency.  One  or  the  other  modification  will  be 
Incorporated  into  the  system.  Difficulties  arise,  however,  when  the 
modifications  affect  different  modules,  especially  if  the  modifications  are 
proposed  as  different  solutions  to  the  same  problem. 

The  minimum  version  control  system  can  also  be  used  to  maintain  other 
types  of  information  throughout  the  system  life  cycle  with  additional 
benefits.  Besides  source  and  object  code,  the  version  control  systmn  can  edso 
maintain  requirements  specifications,  design  specifications,  and  documentation 
on-line.  The  maintenance  of  specifications  on-line  Increases  visibility, 
permits  quick  resolution  of  ambiguities,  and  makes  easier  the  tasks  of  design. 
Implementation,  and  maintenance  from  remote  locations.  Little  additional 
extension  is  required  to  provide  these  capabilities,  which  are  basically 
clerical. 

With  somewhat  more  effort  the  basic  version  control  system  can  be  expan¬ 
ded  to  function  as  an  Important  tool  both  in  development  and  maintenance.  A 
fully  expanded  version  control  system  is  essential  if  modern  programming  and 
management  practices  ax^  to  be  t'sed  at  full  potential.  Use  of  the  complete 
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system  is  described  in  the  following  section. 

2.6.2  Version  Control  System  jgol  DaveloiaBent 

The  expanded  version  control  system  can  be  an  essential  part  of  the 
development  process  from  requirements  specification  through  testing.  The 
first  step  in  the  development  process  is  the  preparation  of  a  requirements 
baseline.  Prior  to  agreeing  upon  the  fln£d  software  system  requirements,  the 
requirements  specification  should  be  shown  to  be 

-  c<xnpatlble  with  the  overall  system  design  (which  Includes  hard¬ 
ware  and  desired  operations,  as  well  as  software) 

-  c<xnplete  (so  no  modifications  are  expected) 

-  consistent 

-  testable  (note  that  testing  considerations  begin  early  in  the 
development) 

-  possible  to  be  implemented  on  the  selected  hardwau'e  system  (It 
should  be  possible  to  estimate  resource  requirements  at  this 
time.  If  the  resources  estimated  exceed  the  resources 
available,  either  the  requirements  can  be  modified  to  specify  a 
more  modest  system  or  the  decision  can  be  made  to  Improve  hard¬ 
ware  capability.) 

Once  the  requirements  for  the  project  under  development  have  been  fully 
agreed  upon  by  review  by  all  interested  parties,  modification  of  the 
requirements  specification  should  only  be  performed  through  the  change  control 
process  in  the  same  manner  that  modification  occurs  during  maintenance. 

The  presence  of  a  baselined  requlroaents  specification  as  medntalned  by 
the  version  control  system  permits  the  establishment  and  maintenance  of  for¬ 
ward  and  reverse  traceability  from  requirements  through  design  to  code  and 
from  requirements  to  validation  testing.  Requirements  tracing  permits  eeu'ly 
detection  of  errors  during  the  design  phase,  ensures  comprehensive  testing, 
and  Improves  maintenance. 


The  next  phase  in  system  development  is  the  establishment  of  a 
preliminary  design.  This  includes,  as  a  minimum,  top  level  software  struc¬ 
ture,  data  base  definition.  Interface  definition,  scheduling  criteria,  and 
analysis  of  critical  algorithms.  The  preliminary  design,  which  is  maintained 
in  a  standard  format  by  the  version  control  system,  should  identify  the  map¬ 
ping  from  requirements,  identify  the  groups  or  individuals  responsible  for 
each  software  element,  define  modular  structure,  and  specify  required  proces¬ 
sing  resources,  data  base  organization,  and  estimated  resoiurce  budgets 
(schedule,  manpower  requirements,  and  ocmiputer  resources).  Before  detailed 
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design  and  Implonentatlon  begin,  the  prellmlnaury  design  baseline  should  be 
demonstrated  to  be  viable.  The  following  should  be  met; 

-  All  requirements  have  been  allocated. 

-  Storage,  execution  timing,  and  accuracy  estimates  established 
for  all  modules.  In  later  phases,  monitoring  of  the  accuracy  of 
these  estimates  can  be  used  to  Identify  potential  performance 
problems. 

-  Data  base  defined. 

-  Performance  requirements  can  be  met  under  operational  loading 
conditions. 

These  practices  are  Intended  to  prevent  the  development  of  a  system 
which  does  not  meet  requirements.  The  early  Identification  of  perforaance 


w 


problems  permits  system  redesign  to  reduce  performance  or  processing 
requirements  to  occur  before  coding  Is  underway  and  modifications  become  more 
expensive. 

In  these  phases  the  version  control  system  Is  used  to  Increase 
visibility  by  maintaining  requirements  specification,  design  documents,  and 
performance  estimates  on  line,  easily  accessible  to  developers  at  different 
.locations.  The  version  control  system  may  perform  checking  to  ensure  that  all 
design  elements  are  traceable  to  the  requirements  specification  and  that  all 
requirements  are  met.  In  addition,  automated  tools  may  be  used  to  ensure  that 
desired  conditions,  such  as  consistency  and  completeness,  are  met  by  the 
design. 

Detailed  design  breaks  down  the  preliminary  design  modules  Into  routines 
and  specifies  Implementation  features.  In  this  phase  as  well,  the  version 
control  system  may  be  used  to  ensure  traceability. 

When  coding  Is  underway,  the  version  control  system  Is  used  to  maintain 
and  update  source  code  and  documentation.  Enforcement  of  programming  stan¬ 
dards  Is  also  performed  to  ensure  uniformity  and  aid  In  maintenance.  The  ver¬ 
sion  control  system  can  Include  automated  tools  to  verify  that  programming 
standards  are  met.  Documentation  standards  should  also  be  maintained. 

Code  maintenance  is  only  one  application  of  the  version  control  system 
during  Implementation.  Management  visibility  Is  significantly  Increased.  The 
version  control  system  permits  easy  determination  of  the  status  of  each  module 
(design  complete,  coded,  tested,  etc.).  The  status  of  the  entire  system  can 
be  obtained  from  this  Information  and  problem  areas  In  the  schedule 
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Identified.  Testing  can  edso  benefit.  During  requirements,  design,  and 
Implementation,  test  data  for  each  routine  should  be  identified.  This  data  is 
maintained  by  the  version  control  system  (and  may  be  requlx*ed  by  It).  During 
checkout,  this  data  Is  used  to  test  each  routine  and  the  entire  system.  Test 
failures  should  be  referred  to  the  Individual  or  group  that  originally 
proposed  the  peu'tlcular  test.  The  version  control  system  should  maintain 
testing  status  Information  and  can  be  used  to  verify  that  all  tests  have  been 
met. 

2.6.3  Yttraion  svatam  jud  HaiatapaMo 

The  version  control  system  maintains  a  program  library  which  will 
frequently  contain  several  versions  of  each  routine.  The  controlled  or  system 
master  portion  of  the  library  contains  only  those  versions  that  form  the 
operational  system.  Routines  In  this  portion  of  the  library  should  meet  all 
project  standards  and  be  completely  tested.  Old,  experimental,  and  test  ver¬ 
sions  are  also  available  In  the  program  library,  but  must  be  clearly 
Identified  as  such.  During  design  and  Implementation,  all  software  routines 
should  be  represented  in  the  program  library,  at  first  as  a  program  stub,  then 
as  untested  code.  During  design,  implementation,  and  unit  test,  modification 
of  a  routine  should  be  permitted  only  by  the  individual  or  group  responsible 
for  Its  Implementation.  Following  completion  of  unit  test,  routines  in  the 
uncontrolled  section  of  the  library  can  be  modified  by  anyone  as  long  as  both 
old  and  new  versions  are  maintained.  Routines  In  the  controlled  portion  of 
the  library  can  be  modified  only  with  the  approved  of  the  change  control 
board.  The  change  control  board  must  also  approve  deletion  of  uncontrolled 
routines.  It  Is  Important  that  the  version  control  system  Implement  these 
restrictions  by  requiring  appropriate  authorization  for  any  change. 

The  version  control  system  Is  also  used  to  expedite  the  processing  of 
change  requests.  When  a  change  request  Is  received,  the  version  control 
system  notes  the  source  of  the  request  and  the  date  and  records  the  request. 
The  change  control  board,  which  may  be  a  single  Individual  In  charge  of 
maintenance  If  the  system  Is  small,  or  may  consist  of  representatives  from 
system  design,  operation,  management,  and  users  in  the  case  of  a  large  system, 
obtains  change  requests  from  the  version  control  system.  For  each  request, 
the  board  must  consider  Its  necessity,  its  priority  relative  to  other  needs. 
Its  possible  side  effects,  and  availability  of  funding.  The  board  may  decide 
that  a  requested  change  should  be  Implemented  at  once,  scheduled  for 
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Implementation,  deferred  for  further  analysis,  deferred  without  analysis,  or 
rejected.  The  action  taken  should  be  reported  to  the  originator  of  the  change 
request  by  the  version  control  system  which  also  maintains  the  status  of  all 
requests  until  the  requested  change  has  been  completed  to  the  controlled 
system  (at  which  time  the  originator  of  the  request  should  be  notified)  or 
rejected  by  the  board.  The  board  uses  the  version  control  system  to  determine 
possible  side  effects  of  changes  by  tracing  back  to  the  requirements 
specification  and  forward  to  code.  The  version  control  system  may  notify  the 
change  control  board  that  a  requested  change  has  been  repeatedly  deferred  or 
that  the  same  change  has  oeen  requested  by  more  than  one  originator. 


When  a  change  Is  authorized,  the  board  assigns  responsibility  for 
development  and  testing  to  groups  or  Individuals,  sets  the  schedule,  and 
specifies  the  budget.  The  version  control  system  permits  the  board  to  modify 
status  of  authorized  changes  and  to  take  additional  action  If  required. 
Status  Information  may  also  be  available  to  the  originator  of  the  change 
request. 


The  version  control  system  should  permit  any  changes  to  be  "undone”  by 
backtracking  to  a  previous  version.  Maintenance  of  previous  versions  permits 
this  capability,  but  backtracking  must  be  done  carefully  with  consideration  of 
possible  Interactions  between  changes.  TWo  requested  changes  can  result  In 
four  versions:  the  original,  modification  1,  modification  2,  and  both 

modifications  (1  and  2).  If  the  changes  occur  successively,  three  versions 
will  be  present  In  the  system,  the  original,  modification  1 ,  and  both 

modifications.  If  It  beo<mies  necessary  to  undo  modification  1  by  backtracking 
to  the  original  routine,  modification  2  Is  also  undone.  This  should  be 
reported  by  the  version  control  system,  which  must  therefore  maintain  an  audit 
trail  of  all  authorized  changes. 


2.6.4  Helatlonahlp  la  jQ^IAC  FDPS  lock 

The  considerations  outlined  In  the  proceeding  sections  apply  to  the  ver¬ 
sion  control  system  for  any  large  softwcure  project,  distributed  or  not.  A  few 


additional  considerations  can  be  Identified  for  the  distributed  version 


control  system. 


A  distributed  software  system  must  be  able  to  operate  In  varied 
environments.  If  the  processors  comprising  the  distributed  system  are 
homogenous,  the  environments  may  vary  In  loading  and  In  available  resources. 
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The  software  system  must  be  able  to  meet  performance  and  resource  usage 
requirements  In  each  environment.  The  presence  of  processors  that  are  not 
Identical  Increases  environment  variability.  Use  of  a  high  level  language  may 
permit  one  version  of  the  system  to  operate  at  all  nodes;  however, 
particularly  for  routines  with  high  perfonnance  requirements  or  high  machine 
dependency,  it  may  be  desirable  to  maintain  separate  versions  for  the 
different  environments.  If  this  Is  done,  extra  care  must  be  taken  to  ensure 
that  all  versions  meet  requirements  and  to  ensure  that  all  modifications  are 
carried  out  correctly. 

Change  requests  may  be  local  or  global  In  nature.  For  Instance,  round¬ 
ing  conventions  on  one  system  may  cause  errors  that  do  not  occur  at  other 
nodes.  A  change  to  correct  this  problem  will  typically  affect  the  system  only 
at  the  node  at  which  It  occurs.  It  may  be  considered  desirable  to  permit  such 
changes  to  be  authorized  and  carried  out  locally,  or  they  may  be  authorized  by 
the  global  change  board  and  Implemented  locally.  In  either  case,  local 
modifications  should  be  subject  to  the  same  traceability  requirements  and 
standards  as  global  modifications  and  should  be  available  to  the  global  ver¬ 
sion  control  system.  If  this  Is  not  done,  system  behavior  can  become 
unpredictable  whenever  locally  modified  routines  are  later  modified  globally. 

2.6.5  Resouroea  and  Schedule 

The  extension  of  our  basic  software  management  system  to  provide  the 
capabilities  of  the  basic  version  control  system  for  our  FDPS  should  be  accom¬ 
plished  without  significant  difficulty.  Development  of  the  expanded 
distributed  version  control  system  will  require  significant  effort  and  may  be 
undertaken  as  a  series  of  modifications  to  the  basic  system.  We  believe  that 
the  benefits  resulting  from  the  Increased  capabilities  more  than  outwel^  the 
additional  effort. 
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To  cover  a  12  month  period: 

Manpower 

Senior  Staff 
(3  m-ffl/year) 
Junior  Staff 
(3  m-m/year) 
Programmers 

(12  ffl-m/year) 
Secretarial  Support 
(2  m-m/year) 

Equlianent 

Computer  Time 


man-months 

3 

3 

12 

2 


Moderate 


Timing 

First  period  of  6  months: 

Extension  of  basic  version  control  system 


Last  period  of  6  months: 

Development  of  expanded  version  control  system 
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2.7  sass.  ESTIMATIOM  zoa  DISTRIBDTED  SYSTEMS 

Cost  estimation  for  distributed  system  development  closely  parallels 
standard  models  of  cost  estimation.  It  is  expected  that  the  weightings  of 
various  factors  may  require  adjustment  to  represent  distributed  development 
more  closely.  In  particular,  system  Integration  can  be  expected  to  require  a 
significantly  larger  portion  of  development  resources,  particularly  if 
development  is  carried  out  simultaneously  at  different  nodes.  Manpower 
estimates  must  be  accurate  at  the  modular  level  to  permit  development  of 
modules  in  different  locations. 

Correct,  early  estimation  of  resource  requirements  and  the  assignment  of 
available  resources  to  various  system  components  is  particularly  important  in 
the  development  of  distributed  software,  as  system  loading  may  be  affected. 
Correct,  early  estimation  of  resource  requirements  and  the  assignment  of 
available  resources  to  various  system  components  is  Important  to  any  large 
software  development.  If  resource  requirements  can  be  estimated  early,  it  is 
possible  to  identify  possible  performance  failures  before  coding  takes  place. 
This  is  discussed  further  in  the  section  on  version  control. 

The  first  step  in  developing  a  cost  estimation  system  for  our  PDFS  is 
identifying  and  obtaining  a  standard  cost  estimation  system.  The  system 
chosen  should  permit  detailed  estimation  of  resource  and  manpower  requirements 
at  an  early  phase.  The  systems  developed  by  Putnam  and  by  Boehm  are  possible 
candidates.  Tuning  of  the  standard  system  for  correct  estimation  in  a 
distributed  environment  will  be  accomplished  by  maintaining  careful  resource 
usage  and  manpower  scheduling  records  for  all  software  developments.  The 
model  tuning  process  is  not  expected  to  require  significant  effort,  but  will 
not  be  possible  until  a  number  of  software  development  projects  have  been  com¬ 
pleted. 

Lacking  expertise  in  the  cost  estimation  area  and  significant  experience 
in  PDFS  software  development,  we  have  only  been  able  to  Identify  this  problem 
as  one  requiring  further  study. 


•.V  w. 
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SBCTIOI  3 

DIS1RIBOTED  SYSTEM  DBSIQI  SUPPORT  FACILITIES 

3.1  anoDgcmM 

As  defined  in  Section  1  of  this  report,  the  primary  purpose  or  function 
of  the  ^stem  Design  Support  Facilities  is  providing  information  such  as  per¬ 
formance,  Inherent  reliability,  etc.  about  the  system,  its  design,  or  Its 
implementation.  This  is  accomplished  by  the  use  of  simulators,  emulators, 
monitors,  estimators,  and  testbeds  as  well  as  combinations  of  all  of  these. 
This  class  of  support  capabilities  is  Imploaented  in  software  alone  as  well  as 
with  combinations  of  special  hardware  and  software. 

Everyone  has  a  list  of  favorite  support  capabilities  In  this  class. 
These  lists  seem  to  be  heavily  Influenced  by  that  individual's  experience  in 
using  one  or  the  other.  There  has  not  been  a  great  deal  of  study  of  this 
area,  so  the  list  below  is  certainly  influenced  by  our  own  experiences  in 
studying  and  implementing  distributed  systems.  We  have  attempted  to  Include 
several  that  we  have  not  had  firsthand  experience  with,  but  that  coverage  is 
probably  quite  Incomplete. 

e  Accurately  modelling/describing  the  system  under  examination. 

e  Validity  and  accuracy  cf  the  information  obtained  by  either 
direct  cr  indirect  measurement. 

e  Ability  to  obtain  information  without  "distorting”  the  operation 
of  the  system. 

e  Obtaining  the  information  in  a  timely  and  efficient  manner. 

e  Cost  of  the  support  facility. 

The  majca*  problems  in  the  development,  implementation,  and  use  of  any 
members  of  this  class  of  support  facilities  are  comnon  for  almost  all  of  the 
different  capabilities. 
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3.2  pagQBUHCg  rgASPBBMBIfT 

3.2.1  PuTDoaea  of  Meaatir— ant 

There  are  three  general  purpoaea  of  performance  evaluation:  selection 
evaluation,  performance  projection,  and  performance  monitoring  [Lucas  71]. 
These  are  shown  In  Figure  3-1. 


I  II  II  I 
I  Selection  I  I  Performance  I  I  Performance  I 
I  Evaluation  j  I  Projection  I  I  Monitoring  I 
I _ I  I _ I  _ I 

Figure  3*-1  Purpoaea  of  Perfonaanoo  Measurement 

Selection  evaluation  Involves  the  comparison  of  existing  syatema.  The 
most  frequent  application- of  selection  evaluation  techniques  Is  for  comparison 
of  computer  systems  to  determine  which  system  performs  a  given  function  most 
efficiently  or  whether  a  given  ^stem  configuration  can  support  a  particular 
application.  Selection  evaluation  Is  also  applicable  when  measuring  the 
Impact  of  different  hardware  or  software  on  an  existing  system.  For  example, 
selection  evaluation  Is  useful  In  determining  whether  the  addition  of  a  load 
balancing  algorithm  Improves  interactive  response  time.  Similarly,  selection 
evaluation  can  answer  the  question  "Did  the  last  change  to  the  operating 
system  Improve  performance?"  In  all  cases,  the  defining  feature  of  a  selec¬ 
tion  evaluation  Is  that  the  systems  to  be  compared  must  exist  and  must  be 
available  for  testing. 

Performance  projection  techniques  are  often  applicable  during  the  design 
of  new  hardware  cmd  software  systems.  These  techniques  attempt  to  predict  the 
performance  cf  new  hardware  and  software  designs  prior  to  ImplementatlMi. 
They  can  also  be  used  to  predict  the  performance  of  a  system  imder  a  new  work¬ 
load  or  with  a  different  hardware  configuration.  Performance  projection  tech¬ 
niques  can  often  be  applied  to  the  same  problems  as  selection  evaluation  tech¬ 
niques.  However,  the  distinguishing  feature  Is  that  It  may  not  be  practical 
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to  actually  test  the  systeos  under  consideration:  it  may  be  too  expensive  to 
test  the  actual  configuration,  the  configuration  may  not  be  available,  or  the 
system  may  not  exist  at  all. 

Performance  monitoring  techniques  are  applied  in  an  attempt  to  under¬ 
stand  the  behavior  of  existing  systems  towards  the  goals  of  improving 
efficiency  and  service  to  users.  It  usually  involves  observing  an  existing 
system  under  normal  operating  conditions.  Quantities  measured  with  per¬ 
formance  monitoring  techniques  are  usually  very  dependent  on  the  system 
measured  (e.g. ,  number  of  page  faults,  number  of  times  the  dispatcher  is 
entered,  etc. ) .  For  this  reason,  perf c»miance  monitoring  techniques  are 
usually  applicable  only  for  the  compeu'ison  of  similarly  structured  existing 
systems.  For  instance,  it  is  difficult  to  compare  the  performance  of  systems 
that  use  different  disk  block  sizes  by  comparing  the  number  of  physical  disk 
reads  and  writes. 

In  a  distributed  processing  system  testbed  facility,  performance  evalua¬ 
tion  will  be  necessary  for  all  three  purposes.  One  need  for  a  perforaance 
measurement  tool  is  in  the  area  of  selection  evaluation.  It  is  necessary  to 
test  prototype  systems  and  compare  the  results  with  the  results  predicted  by 
performance  projection  techniques,  as  well  as  with  results  obtained  by  testing 
other  systems.  The  tool  must  be  able  to  empirically  measure  the  performance 
of  existing  software  and  hardware  configurations,  and  must  be  able  to  provide 
comparable  measurements  on  similar  configurations. 
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3.2.2  Tafthwinu«n<i  xstc.  VwtQTmuw  MgM.ttrwaeat 

A  number  of  different  performance  measurement  techniques  can  be  applied 
for  the  purposes  mentioned  above.  Figure  3-2  shows  these  techniques. 


Perfonaanoe  Neasorenent  Teohnlquaa 


I  I 

I  Modeling  I 
I  Techniques  ( 
I  I 


I  Measurement  | 
I  Techniques  I 


t  Analytical  I  |  Instruction  t  I  Monitoring  I 
I  Techniques  j  |  Mixes  I  I  Techniques  t 


Figure  3-2  Techniques  for  Perforaanoe  Measxirement 


Most  of  these  techniques  can  be  utilized  for  all  purposes  of  performance 
measurement,  but  some  provide  only  marginally  useful  results.  Since  a 
distributed  processing  system  testbed  performance  measurement  tool  is  needed 
for  the  purpose  of  selection  evaluation,  the  following  discussion  of  per¬ 
formance  measurement  techniques  is  confined  to  those  applicable  to  selection 
evaluation. 

There  are  two  classes  of  perfcxmanoe  evaluation  techniques  that  can  be 
used  f(»*  selection  evalutlon:  modelling  techniques  and  measurement  techniques 
[Ferrari  78].  Modeling  techniques  involve  building  a  representation  of  the 
system  to  be  evaluated  and  then  testing  that  model.  Although  most  useful  in 
performance  projection,  modelling  techniques  can  also  be  used  for  selection 
evaluation.  A  significant  problem  with  all  modelling  techniques  is  determin¬ 
ing  how  well  the  model  reflects  the  system  it  models. 
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VaIldation[of  a  model]  is  often  difficult,  and  sometimes  impos¬ 
sible.  It  may  be  based  on  previous  theoretical  or  simulation 
results,  but  if  the  modeled  system  exists,  the  ultimate  foun¬ 
dations  of  a  validaton  model  must  be  empirical.  .  .  Thus,  in  a 
sense,  measurement  is  the  most  important  evaluation  technique, 
since  it  is  needed  also  by  the  other  techniques.  [Ferrari  78] 

Measurement  techniques  Involve  actually  measuring  the  behavior  of  an 
existing  system  and  are  thus  applicable  only  when  the  performance  of  a  system 
can  actually  be  determined.  Several  of  the  measurement  techniques  (instruc¬ 
tion  timings,  instruction  mixes,  and  kernel  programs)  merely  make  comparisons 
of  hardware  parameters  such  as  memory  cycle  time,  addition  times,  etc.  These 
techniques  are  generally  useful  only  as  a  supplement  to  more  powerful  tech¬ 
niques  when  used  to  compare  hardware  configurations  and  are  Inadequate  when 
used  to  compare  software  systems  [Lucas  71]. 

HardWcU'e  and  software  monitoring  techniques,  which  usually  involve  the 
recording  of  such  things  as  the  number  of  page  faults,  number  of  cache  misses, 
etc. ,  provide  a  great  deal  of  information  about  the  performance  of  a 
particular  system.  But  since  the  parameters  that  can  be  measured  are  usually 
very  specific  to  a  particular  implementation,  comparisons  between  systems  with 
different  internal  structures  are  usually  difficult  to  interpret. 

The  remaining  measurement  techniques,  generally  called  benchmark  tech¬ 
niques  [Svobodova  76],  involve  actually  running  a  system  using  a  set  of  real 
or  carefully  contrived  input  and  measuring  the  response  of  the  system.  Since 
the  benchmark  techniques  treat  the  system  under  test  as  a  "black  box",  measur¬ 
ing  only  stimuli  and  responses,  they  eu*e  immune  to  many  of  the  problems  of 
other  measurement  techniques.  In  general,  the  only  significant  difficulties 
of  benchmark  techniques  are  in  the  determination  of  the  input  to  the  system 
under  test  and  in  the  analysis  of  the  output  of  the  system  under  test. 

To  support  the  testbed,  the  performance  measurement  must  be  capable  of 
consistently  applying  arbitrary  benchmarks  to  the  machines  that  are  or  will  be 
a  part  of  the  testbed.  It  must  also  allow  arbitrary  analysis  of  the  responses 
of  the  testbed  equipment.  This  decision  permits  a  generally  useful  tool  for 
the  testbed,  while  not  encumbering  or  presupposing  knowledge  of  the  research 
Issues  of  either  the  FDPS  project  or  of  benchmeu'k  techniques. 
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3.3  amnuTOHs 
3.3.1  Dea«M[»lPtlon 

Simulators  are  perhaps  the  most  popular  kind  of  facility  used  to  support 
the  design  of  distributed  processing  systems.  They  represent  the  research 
technique  of  first  choice  (as  well  as  often  the  last  resort)  for  answering 
many  questions  about  the  operating  characteristics  of  proposed  systems.  These 
facilities  range  In  scope  and  complexity  from  fairly  short  computer  programs 
designed  to  answer  specific  questions,  to  quite  substantial  software  systems 
capable  of  addressing  a  broad  range  of  problems  and  a  variety  of  operating 
environments.  In  any  digital  simulation,  there  Is  an  attempt  to  model  the 
salient  features  of  a  target  system  with  repesentatlons  of  Its  current  status 
and  the  significant  events  affecting  that  status.  Simulators  are  generally 
designed  to  model  the  Interrelationships  among  many  subcomponents  of  a  system 
In  such  a  way  that  their  Interaction  and  the  effects  of  various  operating 
parameters  on  the  overall  performance  of  that  system  can  be  examined  and 
recorded. 

Network  simulators  are  tools  for  modelling  network  conponent  Interac¬ 
tions.  They  are  essential  In  the  early  analysis  and  design  phases  of  network 
development  and  extremely  useful  during  the  maintenance  of  such  a  system.  The 
purpose  of  these  simulators  Is  performance  measurement  and  evaluation  of  codh 
munlcatlon  protocols,  network  control  algorithms,  network  topologies,  and  many 
other  operational  characteristics.  The  simulator  program  executes  a  sequence 
of  deflnad  states  or  events  In  network  activity.  Input  to  the  program  are 
various  parameters  of  the  model.  Output  depends  on  the  function  of  the  tool. 
The  output  could  be  a  transaction  log  of  an  event  sequence,  for  example,  or 
performance  measures  of  critical  events. 

Typically,  the  simulation  tool  models  a  manageable  subset  of  the  overall 
distributed  environment.  A  hybrid  simulation  approach,  c(»Dblnlng  network 
simulators  with  analytic  models.  Is  useful  in  modelling  complex  environments. 
Techniques  like  the  hybrid  approach  may  be  the  only  weiy  to  represent  In  a  sim¬ 
ple  form  the  complex  Interactions  of  cm  operational  system.  The  hybrid  tech¬ 
nique  Is  often  useful  In  reducing  the  execution  time  requirements;  however,  It 
Is  only  applicable  when  validated  analytic  expressions  are  available  to 
describe  the  performance  of  well-defined  components  of  the  overall  system. 
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3.3.2  Baoltground 

Interest  In  simulation  techniques  is  both  longstanding  and  widespread. 
Their  application  to  computer  systans  seems  particularly  appropriate,  in  that 
the  behavior  of  extremely  complex  target  systems  can  generally  be  broken  down 
into  more  manageable  components  which  exhibit  well-defined  state-event 
transitions. 

The  development  of  simulators  is  largely  motivated  by  obvious 
limitations  in  two  prominent  alternatives:  mathematical  analysis  and 
empirical  observation.  Analytic  techniques  are  often  difficult  to  understand, 
requiring  a  fairly  extensive  background  in  applied  mathematics  for  adequate 
comprehension.  This  limits  not  only  the  number  of  active  practitioners,  but 
also  the  population  of  appreciative  and  accepting  readers.  Far  more  serious 
perhaps  are  the  simplifying  assumptions  required  by  many  analytic  models.  It 
may  be  difficult  to  determine  the  extent  to  which  these  assumptions  can  be 
violated  without  completely  Invalidating  the  results  of  an  analysis.  When 
restrictive  assumptions  are  weakened,  an  analytic  model  may  become  intrac¬ 
table,  yielding  only  approximations,  probabilistic  algorithms,  or  very  costly 
"brute  force"  computations.  While  the  validity  of  any  analytic  or  modelling 
technique  to  real  systems  is  alwsys  suspect,  this  problem  seems  particularly 
serious  for  mathematical  analysis. 

Research  data  from  empirical  observations  can  be  very  persuasive  in  any 
scientific  study.  In  asserting  the  merits  of  some  new  technique,  it  is 
p8U'tlcularly  Important  to  be  able  to  compare  measured  performance  with  similar 
data  based  on  observations  of  some  other  approach.  Unfortunately,  operational 
systems  can  be  very  costly  to  develop  strictly  for  research  purposes.  The 
time  required  to  construct  such  a  testbed  may  be  an  even  more  serious 
consideration. 

While  simulators  are  certainly  not  a  substitute  for  the  development  of 
prototypes,  they  can  play  an  important  role  In  -the  design  process,  before  more 
substantial  resources  are  committed.  Additionally,  they  can  provide  an 
Important  research  advantage  over  operational  systems  in  offering  greater 
control  over  extraneous  variables  that  affect  performance.  In  an  operational 
system,  it  Is  usually  Impossible  to  eliminate  or  even  hold  constant  all  of 
these  factors,  particularly  if  the  costs  of  research  are  controlled  by  using 
the  testbed  for  other  purposes.  Perhaps  even  more  serious  are  limitations  on 
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the  range  of  factors  that  .acg  the  primary  target  of  Investigation.  To  be  able 
to  generalize  ones  findings  beyond  the  context  of  current  operating 
parameters,  It  Is  Important  to  be  able  to  select  extreme  values  for,  say, 
donand  for  service  or  applied  load.  Various  combinations  of  values  may  never 
occur  naturally  at  a  particular  Installation,  or  they  may  occur  so  rarely  that 
Interesting  cases  cannot  be  properly  studied. 

Modeling  Is  an  efficient  manner  of  assessing  system  behavior.  In  the 
FDPS  environment,  network  simulators  extend  modelling  to  cover  a  broad  range 
of  activities.  Two  major  functions  of  network  simulators  are: 

e  Validate  analytic  models  of  operation. 

e  Evaluate  performance  of  network  protocols: 

•e  Components 
«e  Protocols 
ee  Interfaces 

The  assumptions  of  analytic  models  are  constraining  but  necessary  for  adequate 
solutions.  Message  Independence  Is  an  example  of  one  strong  assumption  used 
In  most  analytic  solutions.  Also,  most  analytic  results  apply  to  a  steady 
state  condition,  even  though  It  may  never  exist.  [RelserSP]  Simulation  Is 
mere  flexible  In  describing  such  events,  and  the  results  may  confirm  the  slm> 
pllfying  assumptions  of  analytic  models. 

The  probabilistic  character  of  network  events  and  the  complexity  of 
their  Interactions  places  heavy  demands  on  analytic  models.  Advances  In  queu¬ 
ing  theory  have  not  produced  computationally  tractlble  solutions  to  many 
problems.  Simulators  can  eveduate  network  protocols  that  are  analytically 
Intractlble. 

Network  simulators  are  an  Integral  part  of  communication  system  support. 

The  design,  test,  and  maintenance  stages  of  network  development  rely  on 

simulation  to  eveduate  various  protocols,  models  of  behavior,  and  computer 

comaunlcatlon  products.  Some  of  the  areas  of  performance  evaluation  Include: 

Low-level  communication  protocols 
Conmunlcatlon  access  methods 
High-speed  local  area  networks 
Routing  and  flow  control  models 
Distributed  control  models 
Workload  management  algorithms 
Transaction  processing 
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As  the  niunber  of  computer  networks  Increases,  overall  system  performance 
evaluation  under  different  workloads  will  guide  maintenance  and  expansion 
design  decisions. 

3.3.3  ^  Solved 

«  Major  problems  to  be  solved 

••  Ability  to  examine  detail  at  various  levels 

••  Efficiency  of  run  time 

••  Validation 

••  Ease  of  contraction 

ee  Modifiable 

It  Is  unfortunate  that  most  simulators  are  designed  as  ad  hoc  facilities  that 
cannot  be  readily  Integrated  with  or  even  compared  to  each  other.  This 
problem  can  probably  be  attributed  to  a  very  short  software  life-span,  since 
simulation  programs  are  often  shelved  after  serving  their  limited  research 
purpose.  They  rarely  enter  a  phase  of  prolonged  utilization  that  might 
Justify  an  additional  Investment  In  standardization  or  even  the  kind  of 
documentation  that  Is  expected  for  commercially  viable  software  systems. 

Validation  of  a  simulation  model  Is  an  even  more  serious  problem,  since 
the  lack  of  it  can  adversely  affect  the  credibility  of  any  study.  The 
difficulty  lies  In  the  selection  of  standards  or  criteria  on  which  to  base  a 
validity  assessment.  Ideally,  one  might  simulate  a  few  limited  cases  which 
can  be  verified  by  agreement  with  available  systems,  but  simulators  are  often 
developed  precisely  because  the  opportunity  to  test  Interesting  cases  does  not 
exist  on  available  systems.  Agreement  in  the  limited  cases  that  do  apply  may 
be  better  than  nothing,  but  not  much,  since  simulators  reu'ely  exhibit 
operational  uniformity.  In  fact,  they  often  execute  entirely  different 
procedures  to  model  very  different  systems.  Similar  arguments  apply  to 
validation  with  respect  to  analytic  models.  The  further  the  simulator  departs 
from  the  domain  of  these  models,  the  less  useful  they  £U'e  as  a  basis  for 
validation. 

The  only  satisfactory  solution  to  this  problem  seems  to  be  validation  by 
Independent  simulation.  General  agreement  of  two  or  more  simulators  designed 
sepcu*ately  to  study  the  same  problems  Is  an  Impressive  achievement  that  rein¬ 
forces  conviction  In  the  results  of  both.  Ironically,  It  may  be  desirable  In 
this  context  to  Intentionally  limit  the  transfer  of  Internal  documentation  for 
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each  simulator,  so  that  subtle  eu'tifacts  do  not  cross  over  and  contaminate  the 
"independent"  results. 


Unfortunately,  Independent  verification  of  previous  experimental 
evidence  is  seldon  seriously  undertaken.  The  excitement  and  prestige  of 
breaking  fresh  ground  seem  to  draw  attention  away  from  the  important  work  of 
confirming  and  consolidating  previous  finings.  UllS,  of  course,  is  a  general 


problem  common  to  many  scientific  disciplines. 

Simulators  tend  to  be  enormous  programs.  General  problems  of  large 
scale  software  development  relate  to  building  an  efficient  simulator.  Because 
of  the  large  task  of  designing,  documenting,  coding,  debugging,  and  testing  a 
simulator,  it  may  be  economically  more  feasible  to  actually  implement  some 
network  components  in  an  existing  testbed  environment. 


Often  the  statistical  results  of  certain  vauriables  in  the  simulator  are 
important  to  research.  Qasslcal  statistics  requires  large  numbers  of 
Independent  samples  to  arrive  at  acceptable  confidence  levels.[Tobagl78] 
Generating  a  large  sample  size  could  Involve  thousands  of  computer  runs  of  the 
simulator. 


If  the  program  is  long,  it  may  drive  the  computing  cost  of  a  large  sam~ 
ple  size  extremely  high.  On  the  other  hand,  an  experimental  design  based  on  a 
small  sample  size  does  not  achieve  critical  levels  of  statistical  significance 
and  is  generally  unacceptable. 


Other  specific  problems  of  network  simulators  are  equally  Important. 
First,  establishing  the  functions  of  the  simulation  tool  and  modular  program 
specifications.  Since  the  simulator  cannot  reproduce  all  the  component 
interactions  of  a  real  system,  some  aspects  of  network  behavior  must  be  left 
out  of  the  model  for  simplicity. 


Second,  developing  simulation  models  of  networks  with  a  high  degree  of 
autonomy  between  nodes.  The  fully  distributed  environment  relies  on  site 
autonomy  to  achieve  many  of  the  design  goals.  The  autonomy  complicates  the 
component  interactions  to  a  degree  that  current  efforts  have  failed  to  model 
them  adequately. 


Third,  insuring  that  the  program  actually  models  the  system  in  question 
and  that  the  results  of  exercising  the  model  are  valid.  When  a  real  network 
exlsta,  Inatrumented  performance  measures  give  an  objective  verification  of 
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simulated  results.  On  the  other  hand,  experimental  research  often  has  no 
operational  system  to  depend  on  for  that  kind  of  support.  Under  those  con¬ 
ditions,  the  validity  of  the  model  depends  on  the  Internal  consistency  of  the 
program,  the  design  logic,  and  the  results  of  other  simulators.  [KobayashlTS] 
Nevertheless,  the  validity  of  simulation  models  Is  an  unresolved  Issue. 

3.3.4  PrgPgflOii  SPltttlQM 

To  facilitate  the  modular  development  of  a  simulator.  It  seems 
appropriate  to  begin  with  the  Identification  and  organization  of  those  adter- 
natlves  relative  to  the  concept  of  "service",  perhaps  along  the  lines  sug¬ 
gested  by  the  ISO  Basic  Reference  Model  for  Open  Systems  Interconnection.  A 
simulator  could  then  be  developed  to  evaluate  the  effects  of  these  design 
alternatives  and  their  Interaction  on  performance  goals  for  Fully  Distributed 
Processing  Systems. 

Initial  approaches  to  network  simulation  In  the  FDPS  environment  may 
concentrate  on  the  modular  design  of  the  program  or  a  collection  of  interact¬ 
ing  programs  to  model  system  behavior.  Dividing  the  overall  system  into  a 
framework  of  Independent  but  Interrelated  parts  allows  the  design  process  to 
focus  on  the  specific  functions  of  each  subsystem.  Designing  each  subsystem 
Individually  and  defining  the  Interaction  between  modules  will  go  a  long  way 
toward  creating  an  overall  model  that  is  accurate  and  flexible. 

One  advantage  of  Implementing  the  network  simulator  as  an  aggregate  of 
subsystems  Is  that  this  approach  allows  the  use  of  a  hybrid  approach.  Com¬ 
ponents  of  the  system  having  analytic  solutions  can  utilize  those  models,  tak¬ 
ing  adveuitage  of  their  simplicity.  The  mathematical  results  of  these  modules 
eure  Integrated  Into  the  descriptive  simulation  procedures  of  other  subsystems 
that  are  analytically  Intractlble  or  require  more  detail. 

Another  advantage  Is  that  each  subsystem  module  may  run  separately  from 
others  for  more  frequent  trials.  Achieving  a  significant  sample  size  may  be 
possible  by  running  many  smaller  programs  rather  than  one  large  program. 

3.3.5  Belatlonshlp  1ft  iJUlfill  ZSS&  JffiCk  IIUL  aSC»B 

Related  work  along  these  lines  In  the  Georgia  Tech  FDPS  Research  Program 
has  prompted  the  development  of  several  different  simulators.  Each  provides 
only  a  fairly  limited  view  of  a  conplete  system,  utilizing  quite  different 
assumptions  about  the  underlying  Interconnection  and  support  structiu'e.  This 
work  needs  to  be  expanded  and  consolidated  Into  a  more  comprehensive  evalua- 
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tlon  of  options  available  at  many  different  levels  of  analysis. 

An  event-based  FDPS  simulator  has  been  developed  which  simulates  func¬ 
tions  typically  provided  by  local  operating  systems,  functions  provided  by  a 
distributed  and  decentralized  control  scheme,  and  the  load  placed  upon  the 
system  by  users  attached  to  the  system  through  terminals.  This  simulator  has 
been  used  successfully  In  two  separate  research  efforts  In  the  FDPS  project: 
one  analyzing  control  strategies,  and  the  other  analyzing  work  distribution 
schemes. 

Host  of  the  FDPS  research  relies  on  network  simulators  to  compare  and 
contrast  different  solutions  to  the  unique  problems  we  are  confronting. 
Recent  work  at  Georgia  Tech  In  distributed  processing  Involves  simulation 
studies.  [MartlnSO],  [EnslowSi],  &  [Sharp82]  Simulation  Is  the  only  method  of 
evaluating  the  behavior  of  dlstrbluted  network  activity  In  absence  of  an 
operational  testbed.  Further  research  on  the  operational  support  capabilities 
will  need  good  simulation  tools  for  adequate  design  and  analysis. 

3.3.6  BMflurflM  jBM  Sohadulo 

Such  a  project  would  probably  require  at  least  a  one-third  time  commit¬ 
ment  by  three  or  four  system  analysts/programmers  under  the  active  guidance  of 
two  system  designers  for  a  period  of  approximately  two  years.  This  time 
period  would  permit  evolutionary  develoixaent,  which  is  recognized  as  the  only 
viable  approach.  Work  should  begin  immediately,  since  the  results  of  this 
study  should  be  quite  valuable  In  guiding  other  efforts  In  the  development  of 
an  operational  distributed  processing  system. 

Considering  the  development  of  network  simulators  as  a  major  software 
engineering  task,  many  of  the  same  resources  are  needed  as  for  Implementing  an 
operating  system.  Basically  the  Job  requires  computing  services,  time,  and 
people.  Computer  resources  for  running  simulations  and  analyzing  the  results 
are  also  necessary. 
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To  cover  a  24  month  period: 


Manpower  man-months 

Senior  Staff  1 2 

(2  at  3  m-m/year) 

Junior  Staff  24 

(3  at  4  D-m/year) 

Programmers  24 

(2  at  6  m-m/year) 

Secretarial  Support  6 

(3  m-m/year) 


Equipment 

Computer  Time  Very  high 

Timing 
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3.4  LOAD  EMULATORS 

3.4.1  Reaote  Load  &iilatoi»a  -  Short  Desorlption 

One  of  the  goals  of  the  on-going  research  in  the  School  of  Information 
and  Computer  Science  is  the  creation  of  a  testbed  facility  for  the  implementa¬ 
tion  and  evaluation  of  fully  distributed  processing  systems  (FDPS).  An 
essential  feature  of  the  testbed  is  the  requirement  to  empirically  evaluate 
the  performance  of  fully  distributed  processing  systems  during  their 
implementation.  Providing  a  facility  that  measures  these  systems  by  generat¬ 
ing  an  external  lead  and  measuring  external  response  can  be  done  by  a  remote 
load  emulator. 

A  remote  load  emulator  (RLE)  is  a  device  that  emulates  sources  of  on¬ 
line  input  to  a  computer  system.  An  RLE  is  one  of  the  most  reliable  tools  for 
measuring  the  performance  of  remote-access  computer  systems.  The  general 
purpose  RLE  must  emulate  both  batch  input  and  interactive  sources.  When  the 
definition  of  interactive  users  is  extended  to  Include  processes  interacting 
with  one  another,  we  see  that  "interactive  users"  are  of  primary  concern  to 
us.  In  order  to  emulate  a  wide  variety  of  interactive  input  devices,  an  RLE 
is  controlled  by  programs  known  as  scripts.  A  script  describes  a  sequence  of 
actions  to  be  performed  by  the  RLE.  Such  a  sequence  mi^t  include  messages  to 
be  transmitted  to  the  system  under  test  cuLong  with  their  timing,  responses 
possible  from  the  system  under  test,  and  actions  to  be  taken  after  a  specific 
response  is  received.  As  well  as  performing  actions  as  specified  by  the 
scripts,  the  RLE  should  record  all  the  communication  activity  for  later 
analysis. 

3.4.2  BWMtfl  .  lUnlrgr^iinH 

Performing  a  benchmark  on  a  system  first  Involves  devising  a  workload  to 
apply  to  the  system  under  test.  Svobodova  defines  the  workload  of  a  syston  as 
"the  total  of  resource  demands  generated  ^  the  user  community"  [Svobodova76]. 
Seen  from  the  benchmark  point  of  view,  devising  a  workload  is  simply  defining 
the  set  of  inputs  to  be  presented  to  the  system  under  test.  It  is  not  a  func¬ 
tion  of  a  remote  terminal  emulator  to  design  the  workload  to  be  used  as  the 
benchmark.  The  user  must  be  responsible  for  devising  a  representative  work¬ 
load  baaed  on  the  system  to  be  tested  -  the  performance  measurement  tool  need 
only  be  able  to  apply  an  arbitrary  but  defined  workload. 
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Once  a  set  of  benchmark  Jobs  have  been  chosen  and  tested,  the  workload 
can  be  applied  to  a  particular  system  configuration.  A  batch  system  may  be 
tested  by  simply  placing  each  Job  deck  in  the  card  reader  at  a  preappointed 
time,  and  noting  the  time  needed  for  the  completion  of  all  of  the  Jobs.  Test¬ 
ing  a  slightly  different  configuration  presents  no  additional  problems.  The 
workload  In  this  case  Is  repeatable;  It  can  be  run  several  times  on  one  system 
and  barring  malfunctions,  one  can  expect  similar  results. 

Testing  of  an  Interactive  system  Is  much  more  difficult.  Since  an 
interactive  workload  Is  generated  by  users  entering  data  at  terminals.  It  Is 
very  difficult  to  generate  a  repeatable  workload  without  additional  computer 
assistance.  In  general.  It  Is  not  possible  to  get  a  dozen  or  more  people  to 
type  In  commands  In  exactly  the  same  order  and  "think”  for  exactly  the  same 
time  for  many  consecutive  test  sessions.  To  obtain  conparable  results  from 
several  test  sessions.  It  Is  necessary  to  have  a  means  to  emulate  the  actions 
of  the  Interactive  users  and  to  repeat  the  same  workload  many  times  without 
tiring. 

A  Remote  Load  Emulator  (hereafter  referred  to  as  an  RLE)  Is  Just  such  a 
device.  Its  primary  function  is  to  emulate  the  load  placed  on  a  system  by 
remote  sources  attached  through  communications  links,  such  as  terminals, 
sensors,  and  process  controllers,  RLEs  are  quite  useful  In  performance 
measurement  and  evaluation,  as  well  as  for  emulating  devices  in  multi-dropped 
line  protocols,  monitoring  communloatlon  line  activity,  and  providing  a  host 
qrstem  for  the  testing  of  communications  line  protocols. 

When  used  for  performance  evaluation,  the  RLE  must  produce  a  predefined 
workload  while  recording  data  about  the  responses  of  the  system  under  test. 
To  be  capable  of  generating  an  Interactive  workload  as  well  as  a  batch  work¬ 
load,  an  RLE  must  be  able  to  accurately  emulate  people  typing  at  Interactive 
terminals.  An  Interactive  session,  as  opposed  to  a  batch  Job,  has  three 
additional  characteristics:  1)  future  input  may  be  determined  by  current  out¬ 
put,  2)  there  may  be  pauses  before  input  messages  corresponding  to  user  "think 
time",  and  3)  there  are  pauses  between  Input  characters  corresponding  to  user 
typing  rate  [Svobodova76]. 

For  the  needs  of  the  distributed  processing  testbed,  a  remote  load 
emulator  Is  the  best  choice  for  the  performance  measurement  tool.  As  a 
mlnlmiBi,  the  RLE  must  be  able  to  generate  Interactive  workloads  to  jrlve  the 
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existing  hardweu*e  and  software  in  the  testbed.  Preferably  the  RLE  should  be  a 
general  tool  for  performing  benchmarks;  It  should  be  able  to  emulate  any 
interactive  device,  either  computer  system  or  temlnal,  that  hardware 
considerations  allow  it  to  replace. 


3.4.3  Remote  Bmilatora  ~  ^  Solved 

From  the  preceding  discussion  of  the  motivations  for  the  RLE,  two  design 
objectives  arise:  the  RLE  should  produce  realistic  interactive  workloads  and 
the  RLE  should  remain  an  effective  tool  for  several  years.  These  objectives, 
although  succinct,  not  absolute  requirements.  It  is  necessary,  as  in  most 
software  projects,  to  compromise  some  of  the  objectives  for  practical  reasons. 
For  Instance,  extremely  accurate  time  intervad  measurement  cannot  be  provided 
without  hardware  modification.  Requiring  special  hardware  reduces  the  long¬ 
term  usefulness  of  the  RLE,  but  increasing  its  timing  accuracy  allows  the 
generated  workload  to  be  more  representative. 


TWo  requirements  are  necessary  to  ensure  the  RLE's  ability  to  generate 
realistic  workloads:  the  RLE  must  be  able  to  accurately  emulate  remote 
devices,  and  the  workload  presented  by  the  RLE  must  be  repeatable  [Watkins 
77].  These  requironents  are  based  on  the  primary  motivation  for  the  project: 
some  method  must  be  provided  to  accurately  simulate  real  interactive  users. 


To  be  able  to  accurately  emulate  remote  devices,  the  RLE  must  be  capable 
of  three  things:  it  must  be  able  to  alter  its  behavior  based  on  data  it 
receives  from  the  system  under  test.  It  must  be  able  to  accurately  control 
delays  between  characters,  and  it  must  be  able  to  accurately  control  delays 
between  a  response  from  the  system  under  test  and  the  next  message  from  the 
RLE.  These  requirements  follow  directly  from  the  defining  characteristics  of 


interactive  workloads  mentioned  above. 


The  necessity  that  the  RLE  produce  a  repeatable  workload  is  a  direct 
residt  of  the  purposes  for  which  the  RLE  will  be  used.  Since  it  will  be  used 
to  compare  different  hardware  and  software  configurations,  it  must  be  capable 
of  generating  the  same  workload  time  and  again.  This  is  not  to  say,  however, 
that  given  the  task  of  generating  the  same  workload,  the  RLE  will  generate 
identical  output.  If  the  behavior  of  the  system  under  test  differs,  of  neces¬ 
sity,  response  of  the  RLE  will  differ.  What  must  be  expected  is  that  "each 
time  the  RLE  presents  an  activity  to  the  SUT  [system  under  test]  the  observed 
performance  differences  are  due  to  the  SUT  and  not  to  the  RLE”  [Watklns77]. 
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The  requirements  to  ensure  the  long-term  effectiveness  of  the  RLE  Eire 
perhaps  more  obvious,  since  they  apply  to  most  software  systems  as  well. 
These  Include  ease  of  use,  ease  of  maintenance,  and  flexibility.  It  is  clear 
that  Implementation  of  the  RLE  will  be  wasted  if  use  of  the  RLE  requires  as 
much  effort  and  knowledge  as  is  required  to  implement  a  special  program  to  be 
used  once  to  perform  the  same  actions. 

The  RLE  will  not  be  useful  if  it  is  not  easy  to  maintain  (e.g. ,  if  it 
requires  a  non-standard  environment  with  its  own  special  operating  system  and 
dozens  of  control  files).  Again,  it  will  be  pointless  to  keep  the  RLE  if  it 
requires  more  effort  to  maintain  than  it  does  to  implement  the  special  purpose 
programs  the  RLE  replaces. 

Finally,  although  the  RLE  must  be  easy  to  use,  it  must  be  flexible 
enough  to  perform  complex  and  varied  emulation  tasks,  A  Priori  restrictions 
must  be  avoided  that  prevent  the  RLE  from  performing  such  tasks  as  simulating 
interactive  devices  other  than  user  terminals,  generating  workloads  for 
machines  other  than  those  in  the  testbed,  posing  as  one  or  several  terminals 
on  a  multi-dropped  communications  line,  passively  monitoring  activity  on  a 
communications  line,  or  emulating  a  host  system  for  testing  communications 
line  protocols.  The  RLE  must  also  bo  efficient  enough  to  provide  a  number  of 
concurrent  sessions.  Othewlse,  the  RLE  will  be  of  little  use  in  monitoring 
even  the  existing  systems. 

3.4.4  Remote  Load  Emulators  -  Proposed  Solutions 

It  is  clear  that  the  RLE  must  be  able  to  support  multiple  concurrent 
interactive  sessions,  so  some  concurrency  will  be  required  in  the  RLE.  The 
multi-user  operating  system  supports  multiple  concurrent  processes  and  virtual 
memory,  while  the  single-user  operating  system  does  not.  There  EU'e  only  two 
possible  advantages  in  using  the  single-user  operating  system,  assuming  mul¬ 
tiple  processes  Eire  simulated  to  provide  the  necessary  concurrency:  code  can 
be  shared  between  processes,  and  process  switching  time  can  be  minimized. 
These  advantages  are  not  significant  though,  since  most  modern  multi-user 
operating  systems  allow  reentrant  code  to  be  shared  between  processes. 

Since  use  of  the  single-user  operating  system  provides  no  obvious 
benefits  and  because  it  would  noticeably  complicate  the  project  by  requiring 
the  implementation  of  process  scheduling  and  concurrency  primitives,  use  of 
the  multi-user  operating  system  is  probably  the  best  choice. 
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Another  area  tor  choice  Is  the  structure  of  the  RLE  Itself.  There  are 
three  different  structures  that  can  be  used  for  the  RLE:  the  RLE  can  directly 
Interpret  a  hunan- readable  script  during  the  emulation  session,  the  RLE  can 
compile  a  himan- readable  script  Into  a  machine  Iwguage  prograun,  or  the  RLE 
can  compile  the  human-readable  script  Into  an  easy-to-lnterpret  Intermediate 
form  for  execution.  The  principle  difficulty  with  the  first  choice  Is  that  It 
takes  a  great  deal  of  time  to  parse  a  free-form  program.  Since  the  number  of 
simultaneous  Interactive  sessions  that  can  be  run  may  well  be  determined  by 
CPU  time  requlrmnents,  It  seems  foolish  to  place  the  parsing  load  In  the  most 
time-critical  area  when  better  alternatives  are  avedlable. 

The  second  approach,  compiling  a  script  Into  machine  language,  solves 
the  objection  to  the  first  approach  by  allowing  a  complex  script  language 
while  allowing  quick  execution.  It  does,  however,  present  two  other  problems. 
First,  It  does  not  allow  the  sharing  of  code  between  scripts  (except  between 
Identical  scripts),  since  each  script  would  be  a  sepcu*ate  object  program. 
Second,  It  would  significantly  complicate  the  Implementation  to  directly 
generate  machine  code,  and  generating  assembly  language  or  Fortran  would 
Inconvenience  users  by  requiring  a  great  deal  of  time  for  compiling  and  link¬ 
ing  the  script  programs. 

The  last  approach,  compiling  scripts  Into  an  Intermediate  form, 
minimizes  the  deflclenoes  In  both  of  the  previous  two  approaches.  It  permits 
a  complex  source  language,  while  permitting  efficient  Interpretation.  It  also 
allows  the  Interprete  code  to  be  shared  among  the  concurrent  processes  and  Is 
much  easier  to  Implement  and  maintain.  It  Is  this  approach  that  was  used. 

A  difficult  area  to  address  Is  the  analysis  to  be  done  on  the  output 
from  an  RLE  test  session.  Little  Is  known  about  what  Information  will  be 
required  In  the  analysis  of  a  test  session,  since  many  of  the  projects  that 
might  use  the  RLE  have  not  been  devised.  Because  of  this.  It  Is  necessary  to 
defer  the  decisions  on  the  exact  kinds  of  analysis  that  can  be  performed. 
Fortunately,  there  Is  an  approach  which  allows  this  quite  simply.  The  RLE 
time-stamps  and  imcords  all  Input  and  output  from  Interactive  sessions  during 
emulation.  Instructions  are  written  In  the  script  to  place  various  markers  In 
this  log  along  with  the  session  transcription.  Then,  after  the  emulation  ses¬ 
sion  Is  complete,  these  logs  can  be  analyzed.  Since  events  of  Interest  to  the 
Investigator  have  been  tagged  by  markers  In  the  log,  time  Intervals  can  be 
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easily  computed,  and  other  information  can  be  derived  as  needed.  This 
approach  has  the  benefits  that  the  analysis  code  is  not  built  into  the  RLE  and 
can  thus  be  changedwithout  danger  to  the  integrity  of  the  RLE  code,  and  since 
a  complete  record  of  the  onulation  session  is  made,  analyses  may  be  run  and 
rerun  on  the  same  session  without  the  need  of  repeating  the  expensive  emula¬ 
tion  session. 


As  discussed  above,  RLE  contains  three  components:  the  preprocessor, 
the  Interpreter,  and  the  analyzer.  A  diagram  of  the  structure  of  the  RLE 
appears  in  Figure  3-3. 
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3.4.5  12  ijj&bfic  ms.  Mark  aaS  SSCLa. 

The  development  of  a  suitable  BLECb)  will  greatly  enhance  the  value  and 
usability  of  a  testbed.  It  Is  not  essential;  however  Its  vedue  clearly 
outwelght  Its  costs. 

To  generalize  the  use  of  the  RLE  it  must  be  able  to  emulate  embedded 
processes  interacting  with  one  another.  The  ability  to  add  this  capability 
should  be  considered  when  designing  the  script  defining  language  and 
preprocessor. 

3.4.6  Reaouroes  and  Schedule 

To  cover  a  9-month  period: 


Manpower 

Senior  Staff 

(1  at  1/4  time) 
Junior  Staff 

(1  at  1/4  time) 
Programmers 

(1  at  full-time) 

Equipment 

Computer  - 


Man-months 

2.25 

2.25 

9 


Very  high  for  development 


Dedicated  systems  probably 
required  to  execute  the  RLE. 
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3.5  imum 

"Monitors*  are  utilized  to  obtain  Information  from  the  target  system 
Itself  during  Its  execution. 

A  number  of  different  types  of  monitors  should  be  available.  Sane  of 
these  are: 

Performance  Monitors 
File  Utilization  Monitors 
Network  Activity  Monitors 
Execution  Monitors 

Discussed  below  Is  the  execution  monitor. 

3.5.1  Execution  Monitors 

It  will  be  necessary  to  provide  programmers  with  the  proper  programming 
tools  If  they  are  to  be  able  to  make  effective  use  of  a  fully  distributed 
processing  system.  The  development  of  PRONET  Is  an  initial  step  In  that 
direction,  providing  programming  language  support  for  the  design  and  construc¬ 
tion  of  distributed  programs.  The  work  proposed  here  Is  Intended  to  continue 
this  development  by  providing  a  tool  for  monitoring  the  execution  of 
distributed  programs.  It  will  Involve  dose  Interaction  with  other  resear¬ 
chers  participating  In  the  Fully  Distributed  Processing  Systems  Research 
Program,  particularly  those  working  on  the  design  and  Implementation  of  a 
distributed  operating  system. 

3.5.2  Baakground 

In  a  conventional  programming  environment,  there  are  two  principal 
purposes  for  monitoring  the  run-time  behavior  of  a  program:  performance 
measurement  and  debugging.  (By  "monitoring"  we  refer  to  some  mechanism  tar 
obtaining  information  about  the  performance  of  a  program,  external  to  the 
program  Itself.)  Performance  measurement  is  a  relatively  mundane  application 
of  monitoring  in  such  an  environment,  being  principally  concerned  with  the 
processor  time  requirements  of  various  parts  of  a  program  and  requiring  little 
or  no  Interactive  Intervention  by  a  programmer.  Debugging  Is  considerably 
more  Interesting,  requiring  extensive  programmer  interaction  by  Its  very 
n?tur  «4.  Even  so,  as  pointed  out  by  Plattner  and  Nlevergelt  in  a  recent  survey 
[PlatSl],  relatively  little  work  on  debugging  has  been  reported  in  the 
literature. 


Section  3 


DISTRIBUTED  SYSTEM  DESIGN  SUPPORT  FACILITIES 


Page  71 


Most  of  the  debugging  tools  in  use  today  are  based  on  concepts  developed 
in  the  60 's.  For  instance,  coimonly  cited  papers  on  debugging  by  Evans  and 
Darley  [Evan66],  Ferguson  and  Berner  CFerg63]  and  Balzer  [Balz69]  were  all 
published  before  1970.  Many  debugging  tools  provide  access  to  a  running 
program  only  at  the  machine  language  level.  For  example,  a  recent  paper  by 
Fairley  [Fair79]  reported  on  a  tool  for  specifying  breakpoints  by  assertions 
in  assembly  language  programs.  More  sophisticated  tools  do  allow  a  programmer 
to  debug  his  program  by  interacting  with  it  in  terms  of  high-level  language 
features  such  as  variables,  complex  data  structures  and  complex  statement 
types  (for  example.  Pierce  [Pler74],  Satterthwalte  [Satt72]  and  Myers 
[MyerSO]),  but  such  tools  are  not  commonly  available.  (It  should  be  noted 
that  Just  such  a  high-level  view  is  specified  for  the  Minimal  Ada  Programming 
SuppKjrt  Environment  [D0DR80].)  Sophisticated  debuggers  are  typically 
customized  for  a  particular  language,  thou^  debuggers  for  several  languages 
can  be  built  based  upon  a  single  framework,  with  specialized  information  about 
each  language  Incorporated  as  is  necessary.  A  debugger  allowing  such  high- 
level  interactions  is  likely  to  be  an  Important  part  of  any  useful  program 
development  environment  on  an  FDPS. 

When  we  generalize  our  thinking  to  an  FDPS  from  a  traditional  single¬ 
processor  environment,  the  uses  of  monitoring  become  somewhat  different  and  we 
must  develop  a  new  conceptual  view  of  a  major  part  of  the  monitoring  task.  We 
are,  of  course,  still  Interested  in  performance  measurement  and  debugging,  but 
these  tasks  become  quite  different  in  this  new  environment.  The  reason  for 
this  difference  is  that  we  are  now  concerned  with  distributed  programs  - 
programs  which  cannot  be  monitored  by  considering  a  single  address  space  on  a 
single  machine.  Rather,  we  must  now  be  concerned  with  the  communication 
between  the  various  parts  of  a  program,  for  these  Interactions  will  play  a 
crucial  part  in  our  monitoring  task. 

3.5.3  xa.  Xa  aolvad 

Performance  measurement  in  an  FDPS  is  made  more  complex  by  a  number  of 
new  considerations.  Use  of  processor  time  is  no  longer  the  main  performance 
criterion.  Communication  costs  and  the  overall  time  it  takes  to  execute  a 
program,  which  is  affected  by  the  potential  for  parallel  execution  of  subtasks 
and  by  time  spent  waiting  for  messages,  are  equally  Important  considerations 
in  many  situations.  Further,  it  is  much  more  difficult  for  a  measurement 
program  to  monitor  an  entire  program,  since  the  monitored  program  may  be 
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distributed  arbitrarily  across  a  network  of  machines.  It  will  obviously  be 
necessary  for  any  monitoring  program  to  Interact  with  the  distributed  operat¬ 
ing  system  of  an  PDPS  in  order  to  obtain  the  necessary  information  about  the 
distribution  of  a  program  and  about  its  communication  linkage  and  behavior. 

This  need  to  obtain  Information  from  distributed  execution  sites 
naturally  applies  to  debuggers  as  well  as  to  performance  monitors.  In  fact, 
it  is  a  more  complex  problem  in  the  case  of  a  debugger  since  the  debugger  must 
somehow  assist  a  programmer  in  comprehending  the  "state"  of  a  program  which 
consists  of  a  number  of  processes  running  asynchronously  on  several  machines. 
Conventional  debugging  tools  are  certainly  of  little  use  in  this  situation, 
since  they  are  typically  oriented  toward  monitoring  the  operation  of  what 
would  only  be  a  single  process  of  a  distributed  program.  Once  again,  tools 
which  interact  with  the  distributed  operating  system  in  order  to  provide 
information  about  the  status  of  process  interactions  will  be  required.  (Such 
tools  should  also  have  the  capability  to  Interface  with  more  traditional 
monitoring  tools  which  can  be  used  on  the  individual  processes.) 

Just  as  communication  should  play  an  important  part  in  distributed  per¬ 
formance  measurement,  it  should  also  have  a  crucial  role  in  debugging 
distributed  programs.  The  correctness  of  such  programs  will  undoubtedly 
depend  on  the  correctness  of  the  contents  and  sequencing  of  messages  transmit¬ 
ted  between  their  constituent  processes.  Thus  a  distributed  debugging  tool 
must  deal  with  communication  as  a  major  paurt  of  its  Job.  In  fact,  it  is 
conceivable  that  a  communication  monitor  may  be  the  debugger  at  the 
Interprocess  level,  oompleoentlng  traditional  debuggers  which  operate  on 
individual  processes. 

As  a  final  difficulty,  any  kind  of  monitoring  of  a  distributed  program 
will  potentially  generate  a  great  deal  of  information,  which  must  be  conveyed 
to  a  programmer  in  a  comprehensible  manner.  It  will  presumably  not  be  satis¬ 
factory  to  produce  all  of  this  information  independently  for  each  of  the 
processes.  Rather,  the  information  must  be  aggregated  in  some  manner 
consistent  with  the  nature  of  the  monitoring  task  being  performed. 

3.5.4  ProDoaad  aolutiona 

The  network  descriptors  of  PRCMET  will  provide  an  excellent  basis  for 
the  operation  of  distributed  monitoring  tools.  The  inter connection  informa¬ 
tion  these  networks  provide  is  exactly  what  is  required  by  a  monitor  so  that 
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it  can  easily  recognize  the  structure  of  an  entire  program.  Thus  the 
implenentatlon  of  a  distributed  performance  monitor  or  debugger  can  use  our 
PRCRIET  work  as  its  basis. 

As  was  Indicated  in  the  previous  section,  a  communication  monitor  will 
be  a  crucial  part  of  any  of  these  tools.  The  interconnection  specifications 
in  PRONET  networks,  as  currently  designed,  provide  the  minimum  amount  of 
information  needed  by  a  communication  monitor.  That  is,  they  provide  a  list¬ 
ing  of  the  message  paths  between  processes  and  the  types  of  the  messages  which 
may  be  transmitted.  The  task  of  a  monitor  will  be  to  provide  a  programmer 
with  information  about  message  transmission  between  processes.  For  per¬ 
formance  measurement  purposes,  the  most  important  information  will  probably 
Involve  such  factors  as  message  queue  lengths  and  the  amount  of  time  processes 
spend  waiting  for  messages.  A  distributed  debugger,  on  the  other  hand,  will 
be  concerned  with  the  sequencing  of  messages  and  with  their  contents.  It  will 
probably  also  be  required  to  provide  some  capabilities  to  examine  the  opera¬ 
tion  of  individual  processes,  which  may  be  accomplished  by  interfacing  with 
traditional  single  process  debuggers. 

There  seem  to  be  two  useful  approaches  to  the  problem  of  handling  the 
large  amount  of  information  collected  by  monitoring  a  distributed  program: 
graphical  display  and  autcmiated  processing  of  the  information  by  the  debugger. 
The  graphical  display  approach  would  be  most  useful  for  showing  the  connec¬ 
tions  between  processes,  message  queue  lengths,  the  flow  of  messages,  etc. 
Automated  processing  might  Involve  such  things  as  automatically  checking  for 
proper  sequencing  of  messages.  Extensions  to  the  networks  of  PRONET  to  allow 
specification  of  message  sequencing  information  would  be  required  to  make  such 
checking  possible. 

3.5.5  Relationship  Other  FDPB  Work 

Another  major  project  in  the  Georgia  Tech  FDPS  project  is  the  develop¬ 
ment  of  an  operating  system  for  managing  the  resources  of  an  FDPS. 
Preliminary  work  in  this  area  has  been  reported  in  [EnslSl].  The  availability 
of  a  distributed  program  monitoring  tool  should  prove  to  be  quite  useful  in 
the  development  and  tuning  of  a  distributed  operation  system  (DOS).  Vhile  it 
has  been  proposed  that  the  monitor  must  take  advantage  of  some  operating 
system  functions,  basing  the  monitor  on  some  primitive  DOS  oapabllltles  while 
developing  a  full  DOS  should  certainly  be  feasible.  In  fact,  since  the  basic 
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Job  of  a  DOS  Is  to  make  decisions  about  the  distribution  and  sohedullng  of 
programs,  evaluation  of  Its  performance  trill  be  Impossible  wiUiout  a  monitor* 
ing  tool.  Thus  this  capability  needed  In  the  course  of  DOS  development 
provides  an  Immediate  motivation  for  the  Implementation  of  a  distributed 
program  debugger,  which  will  be  useful  to  all  FDPS  programmers. 

3.5.6  RMOaroaB  gadL  Sohedula 

To  cover  a  24  month  period: 


Manpower  man-months 

Senior  Faculty  0 

(0  m-m/year) 

Junior  Faculty  6 

(3  m-m/year) 

Technicians  0 

(0  BHm/year) 

Secretarial  Support  2 

(1  m-m/year) 

Student  Assistants  24 


(2  students  at  6  m-m/year) 

Equipment 

Computer  Time  Substantial 

Timing 

First  period  of  12  months: 

Port  current  PRONET  Implementation  to  Perqs; 
design  and  Implement  communication  monitor. 

Last  period  of  12  months: 

Experiment  with  user  Interfaces  to  debugger, 
using  the  previously  developed  monitor; 
Interface  with  process-level  debugger. 
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3.6  TESTBEDS  ZQR  DISTRIBUTED  SISIEMS 

3.6.1  PMorlBUon 

Due  to  the  complexity  of  the  environment,  it  is  extremely  difficult  to 
evaluate,  by  analysis  or  simulation,  the  effectiveness  of  many  algorithms  and 
heuristics  proposed  for  distributed  systems.  Evaluation  is  made  even  more 
difficult  because  many  algoritlxns  (e.g. ,  scheduling  algorithms)  adapt  the  very 
environment  in  which  they  exist. 

3.6.2 

3.6.2. 1  Rationale  for  Testbed  Development 

e  Realtime  testing  of  distributed  systems  is  a  major  obstacle  to 
their  development 

e  Use  of  a  testbed  may  be  the  only  viable  alternative 

e  Obtaining  real-time  performance  data  may  be  significantly 
facilitated  by  the  availability  of  a  flexible  and  well- 
instrumented  distributed  testbed 

3. 6. 2. 2  Objeotlves  in  Teatbed  Development 

e  Develop  a  facility  which  will  allow  the  advancement  of 
distributed  computing  technology 

e  Provide  the  capability  for  rapid  evaluation  of  architectural 
concepts 

e  Further  the  technology  of  developing  high  speed,  high  per¬ 
formance  distributed  testbed 

e  Standardization  and  Integration  of  distributed  processing  tech¬ 
nology  efforts 

3.6.3  Appraaoh 

One  approach  that  has  been  found  effective  in  evaluating  algorithms  f<»* 
distributed  systems  is  the  use  of  a  testbed  which  shares  many  of  the  charac¬ 
teristics  of  the  environment  in  which  the  algorithms  will  ultimately  be  used. 
Such  testbeds  can  be  oriented  toward  collection  of  various  statistics,  making 
possible  very  close  monitoring  of  the  behavior  of  algorithms.  They  are 
particularly  suited  to  evaluation  of  algorithms  for  concurrency  control, 
scheduling,  load  distribution,  distributed  resource  allocation,  and 
distributed  data  bases. 

We  propose  construction  of  such  a  testbed,  using  between  5  and  10 
machines  in  the  0. 5-1.0  MIP  range,  0.5  mB  main  store,  and  Winchester  disks. 
While  the  proposed  machines  would  be  considerably  smaller  than  the  ultimate 
systems  in  which  the  tested  algorithms  would  be  used,  the  machines  would 
differ  mainly  in  oapaolty,  with  execution  speeds  and  communications  speeds 
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being  cloae  to  those  of  the  ultimate  maohines.  Details  of  the  research 
proposed  under  this  category  will  appear  in  a  later  version  of  this  document* 


3,6.h  Resouraea 

A  distributed  system's  testbed  should  contain  at  least  five 
separate  computing  systems.  These  should  be  homogeneous  systems,  if  not 
Identical.  If  the  target  system  is  to  be  heterogeneous,  then  the 
heterogeneous  components  are  JL&  addition  to  these. 
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3.7  DESIGNER  WORKBENCHES 

The  concept  of  using  the  computer  to  aid  in  the  writing  of  computer 
programs  is  not  new.  Nhat  is  new  is  the  concept  of  a  set  of  wnf-.nnahftri 
"tools*  Integrated  into  a  designers 'workbench.  Obviously,  several  different 
types  of  workbenches  will  be  of  value.  The  most  critical  need  is  probably  in 
the  area  of  distributed  data  base  design. 

3.7.1  Distributed  Database  Dealgnera*  Morkbenah 

3. 7. 1.1  Description 

The  complexities  of  distributed  data  bases  design  far  exceed  the 
limitations  of  manual  procedures.  Evaluating  the  performance  as  well  as  the 
reliability  characteristics  of  alternative  DDB  configurations  requires 
extensive  automated  support. 

3. 7. 1.2  Background 

Principle  work  done  thus  far  in  this  area  has  been  at  the  University  of 
Michigan.  K.B.  Irani  is  working  in  the  area  of  "Modeling  and  Design  of 
Distributed  Databases  and  Communication  Networks”  and  Toby  J.  Teorey  is  work¬ 
ing  with  James  P.  Fry  on  "A  Generalized  Facility  for  Database  Application 
Design. " 

3*7. 1.3  Resources 

This  group  is  not  familiar  enouid>  with  this  area  to  provide  meaningful 
estimates. 
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SECTION  A 

OPERATIONAL  SDPPORT  CAPABILITIES 

4.1  anopflCTiM 

In  this  section,  we  discuss  research  into  capabilities  required  for 
operational  support  for  fully  distributed  processing  systems.  Such 
capabilities  may  be  manifested  In  a  production  system  as  distinct  program 
modules,  collections  of  modules,  algorltlmis  used  within  the  system,  or  fun> 
damental  components  of  the  base  architecture  of  the  system.  In  this  section 
we  discuss  file  systems,  command  languages,  load  management.  Interprocess  cobk 
munlcatlon.  communication  protocols,  and  the  requirements  of  local  operating 
systems  for  support  of  guest,  or  meta.  operating  systems.  Qearly.  further 
operational  support  capabilities  are  required  to  Implement  a  fully  functioning 
distributed  system.  However,  we  consldo*  that  the  capabilities  above  are  fun¬ 
damental  to  operational  support  of  fully  distributed  processing  systems, 
because  they  must  be  addressed  early  In  the  process  of  designing  a  system. 

Ve  propose  a  two-fold  approach  to  the  study  of  these  capabilities. 
Recognizing  the  gains  In  understanding  that  accrue  from  experience  In  actually 
building  systems,  we  suggest  the  construction  of  two  testbed  operating 
systems;  one  testbed  will  take  the  'meta  system*  approach,  and  the  other  will 
be  a  'native*  operating  system.  These  systems  are  strictly  vehicles  for 
research,  however,  and  deviations  from  the  task  of  actually  designing  and 
building  them  will  be  encouraged,  in  order  to  study  Issues  which  arise  In 
designing  their  support  capabilities. 

The  resource  estimates  accompanying  each  capability  description  In  this 
section  represents  estimates  that  apply  if  that  capability  is  to  be  studied 
Independently  of  the  construction  of  a  testbed.  Because  the  various 
capabilities  are  so  Inter-related,  and  because  a  testbed  will  allow  study  of 
more  than  one  capability,  the  overall  estimates  tor  testbed  construction  and 
study  of  Individual  capabilities  are  less  than  the  sum  of  the  resources 
required  to  study  each  capability  Independently.  These  overall  estimates  are 
contained  In  the  summary  to  this  section. 
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4.2  DISTRIBUTED  £IL£  MLL  Mil  MAWAQBMENT  SYSTEMS 
4.2.1  PUPOrjLPtlQB 

File  systems  are  an  integral  part  of  current  operating  systems  and 
appear  to  be  fairly  well  understood  In  a  single  machine  context.  However,  In 
distributed  surroundings,  file  systems  ai*e  much  more  complicated  and  must  sup¬ 
port  new  functions,  such  as  replication.  Studying  and  building  a  com¬ 
prehensive  distributed  transaction-based  file  system  which  supports  versions, 
replication,  concurrency  control  and  recovery  has  Interesting  research 
aspects.  More  Importantly,  such  a  file  system  would  be  quite  Important  In 
practice.  Providing  such  features  as  intrinsic  replication  support  and 
uniform  access  to  data  seems  mandatory  to  attain  many  of  the  well  publicized 
goals  of  distributed  processing. 


4.2.2  Background 

Recently  there  has  been  growing  Interest  In  the  problems  associated  with 
distributed  file  systems.  There  appear  to  be  two  basic  varieties  of  file 
systems  under  analysis:  server  machine  based  and  cooperating  file  system 
based.  This  distinction  Is  perhaps  artificial  in  that  It  may  appear  that  the 
jia£  dictates  whether  a  system  Is  server  or  cooperating  based.  Perhaps, 
another  method  of  viewing  the  situation  is  how  to  evaluate  the  responsibility 
each  client  has  In  Interacting  with  the  storage  system.  That  Is,  does  the 
client  have  to  know  where  (e.g. ,  which  server)  the  data  Is  physically  located, 
or  does  the  client  communicate  with  the  local  system,  vbloh  then  proceeds  to 


locate  and  retrieve  the  data? 


The  following  recent  systems  coiU.d  be  classified  as  server  based: 


WFS  [Swin  79] 

DFS  [Stur  80] 

CFS  [Dion  80] 
Felix  [Frid  81] 
Swallow  [Svob  81] 


The  following,  however,  could  be  considered  cooperating  file  system 


based: 


Locus  [Pope  81] 
Domain  [Leac  82] 


Which  approach  Is  reasonable  depends  most  likely  on  the  hardware 
environment  (size  of  local  storage,  distribution  of  hardware,  etc.).  Both 
schemes  are  very  reasonable;  we,  however,  will  consider  the  cooperating  file 
system  approach  because  It  appears  to  be  more  general,  will  probably  be  used 
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often  in  the  future,  and  Is  best  suited  to  an  FDPS.  Unfortunately,  It  Is  more 
conplex.  This  results  frtan  the  handshaking  required  between  the  file  systems. 

We  consider  the  following  environment.  First,  a  client  transaction 
requests  some  information  (perhaps  via  operations  on  abstract  objects)  frcHn  a 
(probably  local)  file  system  A.  This  file  system  then  locates  the  objects  and 
the  destination  file  system,  called  B,  synchronizes  access  to  them.  Once  the 
transaction  completes  (or  aborts),  the  appropriate  changes  made  to  the  objects 
are  made  permanent.  Throughout  the  transaction's  life,  system  failures  have 
no  permanent  effect  (with  high  probability  [Lamp  76])  on  the  referenced 
objects. 

A  detailed  discussion  of  some  of  the  problems  encountered  with 

distributed  file  systems,  particularly  naming,  is  presented  in  Appendix  A. 

4.2.3  Proposed  Rasaaroh 

The  need  for  some  means  to  access  data  in  a  network  is  obvious. 
However,  to  achieve  two  of  the  principle  advantages  of  distributed  systems, 
efficiency  and  availability,  there  are  many  problems  which  must  be  solved. 
Q early,  simply  linking  two  file  systems  together  will  not  suffice  to  achieve 
either  efficiency  or  availability.  For  example,  extending  Unix  path  names  to 
include  the  node  name  in  the  path  does  not  address  the  problems  of 
replication,  transparency,  or  multiple  file  (on  different  nodes)  commit. 

The  problem  is  to  build  a  distributed  file  system  supporting  each  of  the 
following: 

•  replication 

•  uniform  naming 

•  version  support 

«  transaction  baaed  (atomic  oriented) 

e  "standard”  concurrency  control  (1  writer,  multiple  readers) 

The  next  step  is  to  Include  support  for 

e  general  objects  (with  operations  other  than  read  and  write) 

•  concurrency  control  based  on  specification.  Thus 
serlallzabillty  would  not  be  the  only  correctness  condition. 

Each  of  the  above  file  system  aspects  is  discussed  below.  Even  though 
security  is  an  Integral  part  of  a  file  system,  this  researoh  will  not  address 
this  topic  specifically;  however,  there  Is  a  researoh  area  analyzing  these 
protection  Issues. 
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4.2.3* 1  Replication 

To  take  advantage  of  the  potential  for  enhanced  reliability  that 
distributed  systems  offer  It  Is  desirable  to  be  able  to  redundantly  store 
objects  at  more  than  one  node.  If  the  logical  object  Is  Immutable,  (l.e. , 
never  changes)  the  problem  Is  quite  simple.  For  mutable  objects,  however, 
updates  must  be  coordinated  so  that  all  clients  see  a  consistent  state.  There 
are  general  (and  complex)  solutions  (e.g. ,  [Stur  80]);  however,  simpler 
schemes  such  as  [Leac  82]  may  be  better. 

In  addition  there  appear  to  be  fundamental  differences  In  the 
requirements  placed  on  replicated  data.  One  type  could  be  classified  as 
amorphous,  where  the  object  (file)  can  be  altered  even  during  multiple  node 
failures.  The  other  type,  primary  copy,  distinguishes  one  copy  which  coor¬ 
dinates  updates.  Further  there  Is  the  question  concerning  whether  all  copies 
must  be  updated  automatically  or  whether  a  converging  approach  Is  satisfactory. 

The  tradeoffs  In  cost  (and  complexity)  of  solving  these  questions  are 
numerous.  It  Is  clear,  however,  that  supporting  general  replication  which  Is 
exceedingly  expensive  to  use  (because  It  Is  so  general)  has  little  merit.  A 
more  acceptable  approach  Is  to  construct  schemes  for  maintenance  of  replicated 
data  which  provide  only  weak  consistency.  That  is,  the  copies  of  the  data 
need  not  hold  the  same  values  at  all  times.  A  set  of  such  algorithms  which 
guarantee  central  consistency  (In  the  absence  of  further  changes)  Is  presented 
In  Appendix  I. 

4. 2. 3. 2  Uniform  Naming 

In  the  cooperating  file  server  approach  It  seems  paramount  to  be  able  to 
hide  the  location  of  objects.  Note  that  transparency  should  not,  of  course, 
be  made  mandatory.  To  have  uniform  naming  requires  that  naming  information 
may  give  hints  as  to  a  file's  location,  but  cannot  be  absolute  (an  oracle). 
Above  the  unique  Identifier  level,  the  system  must  provide  user  level  charac¬ 
ter  names.  There  are  Interesting  problems  In  this  area  as  well,  however,  they 
will  not  be  considered.  Many  different  schemes  could  be  layered  above  the 
unique  Identifier  level. 

There  are  many  choices  when  considering  uniform  naming.  See  [Leac  82] 
and  [Shoe  78]  for  good  discusions  of  these  issues. 

4. 2. 3*3  Version  Support 

With  the  advent  of  laser  disc  storage  modules,  versions  have  received 
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much  more  attention  recently  [Svob  81].  However,  the  ability  to  manage  mul¬ 
tiple  versions  Is  very  desirable  even  on  current  hardware.  There  are  many 
reasons  to  require  version  support.  TWo  of  these  are  software  development  and 
possibly  higher  concurrency.  The  software  develoimient  reason  Is  fairly 
obvious;  see  [Reed  78]  for  a  discussion  of  the  higher  concurrency  aspects. 

4.2. 3. 4  Transaction  Based 

Users  need  control  over  what  is  a  recoverable  unit  and  what  Is  a 
"consistent"  view  of  changes  made  to  files.  Most  systems  do  not  provide  this 
database  approach  to  file  storage,  but  it  seems  critical  In  distributed 
systems  where  failures  are  assumed  Independent.  Thus,  support  for  safe  commit 
for  multiple  files  on  multiple  nodes  Is  required.  Most  commit  algorithms  are 
very  expensive  and  some  users  may  prefer  not  to  be  penalized  for  the 
additional  safety.  Thus  using  transctlons  for  recovery  reasons  must  be  client 
controllable. 

It  does  not  seem  reasonable  to  Include  support  for  nested  transactions 
because  of  the  simple  nature  of  most  operations  on  the  files.  However,  If 
true  object  support  Is  Included  the  use  of  nested  transactions  must  be 
reviewed.  An  Initial  specification  for  such  a  scheme  Is  discussed  In  Appendix 

J. 

4.2.3«5  'Standard*  Concurrency  Control 

Most  single  machine  file  systems  support  a  concurrency  paradigm  which  is 
"single  writers  and  multiple  readers."  This  probably  suffices  for  many 
applications  in  distributed  applications  too.  However,  In  general  this  seems 
quite  crude,  since  in  any  aggregate  object  (like  a  file),  many  updates  could 
occur  without  interfering  with  each  other.  Thus  "standeu'd"  concurrency  seems 
only  tentatively  acceptable.  This  Issue  Is  elaborated  in  Appendix  J. 

4. 2.3*6  General  Object  Support 

Files  can  be  viewed  as  simply  Instances  of  abstract  data  types  with  the 
operations  of  (say)  open,  close,  read,  and  write.  It  seems  quite  reasonable 
to  support  the  storage  of  general  objects  such  as  message  ports,  process  loca¬ 
tion  tables,  etc.,  through  the  same  basic  file  system  mechanism.  The  file 
system  then  becomes  a  general  object  management  system.  In  addition  to  making 
the  system  more  general,  higher  concurrency  is  made  possible  by  the  system 
using  the  semantic  knowledge  of  the  operations  on  the  objects.  Further,  the 
memory  /  disk  data  structure  difference  can  then  be  Ignored  by  the  client. 
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Some  wo'k  has  been  done  In  this  £U'ea  (e.g. ,  [Poll  81];  also,  Appendices  H  and 
J);  however,  the  Issues  of  higher  concurrency  have  not  been  addressed. 

4. 2. 3 *7  Speolfioatlon  Based  Conourrenoj 

Host  database  work  defines  seriallzabillty  as  the  means  to  define 
correctness  of  a  concurrency  control  system.  This  is  not  always  reasonable. 
Consider  a  program  reading  the  directory  of  some  file  system.  In  most  cases, 
whether  the  result  Is  serializable  or  not  Is  unimportant;  most  users  do  not 
care  whether  the  directory  list  is  perfect  (Just  that  it  could  have  been  in 
that  state  at  some  time).  This  Is  Just  one  example,  of  the  desire  to  support 
concurrency  based  on  a  specification  which  Is  placed  with  each  object.  The 
specification  would  define  how  the  operations  may  be  Interleaved. 
Seriallzabillty  could  easily  be  specified,  but  many  distributed  applications 
do  not  require  perfect  serialization  (e.g. ,  naming  servers  and  mall  systems) 
and  through  the  specification  could  weaken  the  correctness  condition. 

4.2.4  Halationahlp  ^  £&£&  IfiCk 

This  support  capability  Is  driven  by  the  requirements  placed  on  a  data 
storage  system  In  a  distributed  system.  It  Is  operational  In  nature  using  the 
"best”  technology  available  for  single  machine  file  systems  and  extending  this 
model  as  the  needs  dictate  for  a  distributed  environment.  It  encompasses 
transaction  research  dealing  with  concurrency  and  recovery,  resource 
replication,  and  version  maintenance  in  addition  to  the  "usual"  file  system 
problems,  such  as  naming. 

He  consider  the  proposed  research  to  be  fundamental  to  distributed 
systems.  In  view  of  Its  high  priority,  we  have  begun  conducting  research  In 
the  area  (Appendices  H  and  J). 

4.2.5  Raanuroaa  j|uL  Schedule 

To  cover  a  24  month  period: 

Manpower  man-months 

Senior  Staff  4 

(2  BHm/year) 

Junior  Staff  1 2 

(6  OHm/year) 

Programmers  30 

(3  at  5  m-n/year) 

Secretarial  Support 
(3  D-m/year) 
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Equipment 


Computer  Time 


Substantial 


Timing 


First  period  of  12  months: 

Analysis  and  design  of  data  management  capability. 

Last  period  of  12  months: 

Implementation  &  evaluation  of  prototype  system 
that  supports  file  system  capabilities. 
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IMTERPBOCESS  CQMMDMICATIOM 

The  capability  of  a  process  to  communicate  with  another  on  a  remote  node 
Is  one  of  the  key  funotlons  a  distributed  processing  syst^  must  support  In 
order  to  attain  many  of  the  benefits  claimed  for  distributed  systems.  One 
PEU'tlcular  type  of  IPC  mechanism,  namely,  message  passing  has  been  the  focus 
of  recent  research  and  development  because  It  encourages  modularity  and 
autonomy  of  processes.  The  basic  functional  Issues  of  message-based  IPC  have 
been  resolved,  although  there  Is  and  will  be  a  continuing  search  for  the 
"best”  set  of  message  passing  primitives.  Efficiency  problems  and  the  support 
for  producing  reliable  distributed  programs  are  among  the  major  problems  yet 
to  be  solved.  The  Issue  of  remote  procedure  call  (RPC)  as  a  paradigm  for  mes¬ 
sage  passing  is  currently  a  controversial  area,  although  It  Is  not  elaborated 
here. 

4.3.1  Beoltground 

The  major  advantages  distributed  processing  systems  often  claim  Include: 
(1)  unified  access  to  remote  resources,  (2)  performance  Improvement  by  paral¬ 
lel  operations,  and  (3)  fault  tolerance  through  redundant  resources.  These 
advantages  can  be  obtained  by  close  cooperation  between  processes  residing  on 
separate  nodes  of  the  distributed  processing  system.  Thus,  the  capability  of 
a  process  to  communicate  with  another  on  a  remote  node  Is  one  of  the  key  func¬ 
tions  a  distributed  processing  system  must  support.  Althou^  a  number  of  IPC 
mechanisms  for  distributed  systems  have  been  Identified  [ENSL79]>  one 
particular  type,  namely,  message  passing  has  been  the  focus  of  recent  research 
and  development.  While  the  essential  functional  equivalence  of  message  pas¬ 
sing  and  other  mechanisms  Is  generally  acknowledged  (e.g. ,  [LAUE78]),  there 
have  been  made  s<XDe  arguments  In  favor  of  message  passing  from  the  sofVare 
engineering  point  of  view  CMANN80,  GENT81,  STAN82].  Major  advantages  of  mes¬ 
sage  passing  can  be  summarized  in  the  following  two  points:  modularity  and 
autonomy  of  processes.  With  a  message-based  IPC  mecbeuilsm,  processes  can  be 
written  to  run  entirely  within  private  address  spaces,  disjoint  from  the 
address  spaces  of  other  processes.  This  modularity  property  enhances  softw^u*e 
under standablllty  and  maintenance.  The  process  autonomy  Is  derived  from  the 
generality  of  control  flow  supported  by  message  passing  mechanisms.  They  do 
not  Impose  any  hierarchy  among  processes.  These  two  points,  l.e. ,  modularity 
and  autonomy,  are  particularly  Important  In  fully  distributed  processing 
systems  (FDPS)  which  require  a  hl{^  degree  of  autonomy  among  processes  which 
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cooperate  Independently  of  their  location.  A  large  amount  of  work  on  IPC  by 
message  passing  has  been  reported.  A  number  of  systems  with  message-based  IPC 
facilities  have  been  designed  and/or  built  [AKK074,  BRIN70,  CHER79i  CHER81, 
CR0W81,  GIL081,  HERT78,  JACQ78,  KAMI78,  KAIN80,  KRAM81,  MAEK77,  LI  E79i 

RASH81,  R0WE82,  STIE79,  TEST79,  WALD72,  WULP81].  High  level  programming 
languages  suitable  for  message-based  IPC  £u*e  also  designed  and/or  Implemented 
[AMBL77,  ANDR81,  BRIN78,  C0OK80,  DOD8O,  FELD79>  HOAR78,  INGA78,  KESS8I,  LI8I, 
LISK79,  MA080,  MA778,  SILB8la,  SILB8lb,  yAN8l].  As  part  of  the  research 

program  In  Fully  Distributed  Processing  Systems  at  Georgia  Tech,  a  study  on 
the  characterization  of  message-based  IPC  facilities  has  been  done  [FUKU82], 
and  a  distributed  programming  language,  called  PRONBT,  has  been  developed 
[LEBL8I,  MACC82].  In  general,  the  design  of  a  message-based  IPC  facility  must 
address  the  following  basic,  functional  aspects  of  message  passing:  (1)  how 
to  Identify  the  processes  Involved  In  a  communication,  (2)  how  the  actual  mes¬ 
sage  transmission  Is  carried  out,  (3)  how  the  process  synchronization  can  be 
controlled,  (4)  how  a  process  can  wait  for  and  select  the  next  message  to  be 
received,  and  (5)  how  the  tools  to  cope  with  failure  of  communication  are 
provided.  These  functional  Issues  are  essentially  solved  and  well  understood, 
although  the  way  they  are  solved  varies  from  one  system  to  another. 

4.3.2  PtoMam  iSL  lift  AlULnaaaA 

As  mentioned  In  the  previous  section,  solutions  to  the  functional 
I  problems  basic  to  message-based  IPC  have  been  found.  However,  there  Is  no 

I  consensus  on  the  "best"  set  of  message  passing  primitives  whose  semantics  £u*e 

easy  to  understand,  efficient  to  Implement,  not  error  prone,  yet  powerful 
enough  to  allow  and  even  enooureige  parallel  operations.  The  search  for  such  a 
I  set  of  message  passing  primitives  will  be  continued  as  a  main  engineering 

Issue  of  the  distributed  IPC  design.  There  are  two  language-related  Issues  In 

t 

designing  the  functions  of  a  distributed  IPC  facility.  The  IPC  functions 
should  be  at  a  proper  level  for  Implementing  high  level  programming  langusige 
primitives.  These  functions  must  not  be  too  complex  (e.g. ,  guaranteeing 
reliable  transmission  when  It  Is  not  necessary)  nor  must  they  be  so  simple 
that  Implementation  of  high  level  language  primitives  for  IPC  using  them  are 
i  difficult  or  Inefficient.  On  the  other  hand.  It  Is  desirable  that  the  IPC 

functions  be  flexible  and  general  enough  to  support  multiple  progrannlng 
languages  with  different  concepts  of  Interprocess  communlcaton.  There  Is  very 
little  knowledge  about  such  Interactions  between  the  designs  of  the  IPC  func- 

I 
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tlons  and  distributed  programming  languages. 

One  of  the  Important  remaining  issues  of  the  distributed  IPC  design  Is 
related  to  efficiency  problems.  The  slowness  of  message-based  systems  seems 
to  be  a  conmon  complaint  [WULF81,  CR0W81].  We  need  to  develop  both  general 
and  specific  techniques  to  minimize  the  overhead  of  message  passing.  Another 
efficiency-related  problon  of  message-based  IPC  facility  is  how  to  take 
advantage  of  haurdware  capabilities.  Scmie  types  of  communication  subsystems 
provide  a  capability  of  broadcasting  a  message  to  multiple  nodes  very 
efficiently.  The  IPC  facility  must  provide  the  user  with  a  concept  of  "single 
source  -  multiple  destination"  communication  and  implement  a  mechanism  of 
effectively  utilizing  the  broadcasting  capability  to  deliver  a  message  to  mul¬ 
tiple  processes,  some  of  which  may  reside  on  the  same  node. 

• 

The  last  problem  we  discuss  here  concerns  the  support  required  to 
produce  reliable  distributed  programs.  Since  debugging  a  distributed  pro£sram 
Is  more  complex  and  difficult  than  a  non-dlstrlbuted  program,  It  Is  highly 
desirable  that  the  IPC  facility  recognize  a  user-specified  "protocol"  (or 
characteristics  of  conversation)  among  processes  and  do  the  run-time  checking 
If  the  conversation  conforms  to  the  protocol.  Therefore,  we  have  to  develop  a 
protocol  specification  language  as  well  as  an  efficient  run-time  checking 
mechanism. 

4.3.3  SqlUtJLflaa  so.  initial  ippromoheB 

Concerning  the  search  for  the  "best"  set  of  functions  to  be  provided 
by  the  IPC  facility,  we  have  to  gain  more  experience  of  the  performance  of 
various  message-based  IPC  facilities  as  well  as  experience  In  building  various 
types  of  distributed  programs.  With  respect  to  the  efficient  Implementation 
of  a  message-based  IPC  facility,  the  recent  paper  by  Spector  [SPEC82]  shows  an 
interesting  and  encouraging  approach.  Specter's  experiment  shows  that  simple 
remote  operations  can  be  executed  quickly  (about  150  microseconds  Including 
the  transmission  time  on  Xerox  Alto  computers  using  the  2.94  megabit  Ethernet, 
which  Is  about  two  orders  of  magnitude  faster  than  could  be  expected  If  they 
were  Implemented  in  a  conventional  way).  This  Improvement  In  performance  Is 
made  possible  by  the  specialization  of  the  communication  Interface,  the  use  of 
simplified  protocol,  and  the  direct  implementation  In  microcode.  An 
appropriate  initial  approach  to  the  problem  of  "protocol"  specification  and 
run-time  checking  of  conversations  among  processes  can  be  found  In  [LIVE80]. 
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A  lemguage,  called  the  "Task  Graph  Language",  allows  the  specification  of  con¬ 
nectivity  among  processes,  message  sequencing,  concurrency,  and  mutual 
exclusion.  The  constraints  specified  by  a  task  graph  are  then  enforced  at  run 
time.  In  the  research  we  propose,  the  fundamental  Issue  to  be  examined  Is 
that  of  deciding  which  characteristics  of  Interprocess  communication 
mechanisms  are  best  suited  to  fully  distributed  processing  systems.  For  this 
purpose,  we  propose  that  the  testbeds  be  constructed  with  a  view  to  evaluation 
of  different  Interprocess  communication  protocols.  This  Is  not  an  easy  task, 
since  Interprocess  communication  Is  usually  an  Integral  ccxnponent  of  an 
operating  system.  Once  such  a  testbed  has  been  constructed,  however.  It  can 
be  used  to  evaluate  numerous  protocols  by  experimentation  and  analysis  of 
difficulty  of  use  and  overhead  Incurred. 

Belationahlp  ^  Other  FDPS  Work  and  3SC*a 

Since  distributed  IPC  Is  one  of  the  key  capabilities  In  distributed 
processing  systems,  its  design  requires  close  cooperation  with  the  design  of 
other  operational  support  capabilities,  particularly  distributed  access 
control,  distributed  monitoring,  distributed  file  management,  distributed 
recovery  management,  and  communication  protocols.  The  functions  provided  by  a 
distributed  IPC  facility  also  affect  the  design  of  some  software  support 
tools,  particularly  program  design  languages,  distributed  programming 
languages,  and  Interactive  monitors. 

4*3«5  Heaoureea  .gud.  Schedule 

To  cover  a  24  month  period: 


Manpower  man-months 

Senior  Staff  4 

(2  m-m/yecu*} 

Junior  Staff  1 2 

(6  m-m/year) 

Programmers  30 

(3  at  5  m-m/year) 

Secretarial  Support  6 

(3  m-m/year) 


Equipment 


Computer  Time 


Several  work  stations 
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Timing 

First  period  of  12  months; 

Build  simulator  for  distributed  IPC. 

Last  period  of  12  months: 

Conduct  experiments  In  distributed  IPC. 
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4.4  CQWmiD  LAMOOAGBS 

4.4.1  DeaorlPtlon 

The  command  language  used  In  a  fully  distributed  processing  system  Is  a 
critical  component,  as  it  represents  the  FDPS  to  the  user.  Similar  to  the 
relationship  between  a  programming  language  and  a  compiler  or  between  a  com¬ 
mand  language  and  an  operating  system,  a  distributed  command  language  may  be 
considered  as  being  implemented  by  an  FDPS.  The  term  "user"  is  meant  to  be 
general,  describing  whatever  is  at  the  end-points  of  the  FDPS.  Examples  of 
users  include  application  programs  as  well  as  people  at  various  levels  of  use 
such  as  application  users,  application  designers  and  implementors,  and  system 
implementors.  A  command  language  includes  both  the  commands,  which  request 
action  of  an  underlying  system,  and  the  responses  which  are  returned  by  the 
system,  indicating  the  status  of  the  requested  action.  The  command  language 
may  be  seen  by  a  user  as  a  programming  interface,  or  as  a  series  of  messages 
exchanged  on  a  terminal.  Commands  may  either  specify  requirements,  zdlowlng 
the  underlying  system  to  determine  how  these  are  to  be  met,  or  may  be 
procedural,  specifying  how  the  action  is  to  be  carried  out. 

4.4.2  Baoltgroiind 

Historically,  command  languages  have  been  developed  in  order  to  provide 
the  capabilities  of  operating  systems  to  users.  The  design  has  typically  been 
structured  around  these  operating  system  capabilities,  giving  the  user  a 
somewhat  abstract  view  of  the  operating  system.  As  this  view  is  defined  by 
the  operating  system  rather  than  user  requirements,  the  user  is  faced  with  a 
"s«Bantlc  gap"  which  is  filled  by  becoming  familiar  with  aspects  of  the 
machine  which  are  not  related  to  the  user's  task.  This  can  be  seen  in  the 
proliferation  of  unique  command  languages  available  for  various  machines, 
which  require  a  user  to  learn  more  details  of  a  particular  machine  than  should 
be  necessary  to  get  the  task  completed. 

Command  langueiges  for  computer  networks  have  followed  a  similar  trend. 
For  example,  level  models  such  as  the  OSI  model  are  developed  in  a  bottom-up 
order,  with  attention  being  paid  last  to  the  top  levels,  where  user  command 
languages  are  defined  [HertweckSO].  Needed  is  more  dialogue  between  network 
and  command  language  designers,  such  as  the  relation  which  exists  between  com¬ 
piler  writers  and  language  designers.  This  should  take  place  before  any  stan¬ 
dardization  in  order  to  avoid  standardizing  outdated  techniques,  such  as  batch 
punched-card  workstations. 
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A  distributed  conmand  language  for  an  FDPS  would  be  classified  as  a 
"network  operating  system  command  and  response  language",  or  NOSCRL.  Standar¬ 
dization  activities  are  taking  place  by  several  groups,  ANSI  [PramptonMellor- 
SchlegelSO],  British  Computer  Society  [NewmanSOb],  and  CODASYL  [HarrlsSO], 
thou^  little  Is  being  done  In  the  areas  of  research  [HertweckSO] . 


4. 4. 2.1  Options  for  Conaon  Command  Languages 

There  exists  a  wide  variety  of  command  languages  and  design  philosophies 
for  them;  l.e. ,  some  consider  them  to  be  simply  a  Job  control  language,  some 


consider  them  to  be  a  collection  of  tools  and  some  Insist  that  a  command 


language  should  be  as  powerful  as  any  programming  language.  How  should  a  com¬ 
mand  language  or  possibly  multiple  command  languages  fit  Into  a  fully 
distributed  processing  system?  The  alternatives  for  Incorporating  command 
languages  Into  distributed  systems  (but  not  necessarily  an  FDPS  because  some 
of  the  options  do  not  meet  the  defined  characteristics  of  an  FDPS,)  are: 

1 .  Allow  only  one  command  language  In  the  entire  network. 

2.  Allow  for  one  common  command  language  that  all  nodes  In  the 


network  must  understand.  A  given  node  may  provide  other  com¬ 
mand  languages  but  It  must  first  provide  a  1-1  translation 
between  Its  local  command  language  and  the  common  network  com¬ 
mand  language.  Users  may  access  the  network  from  any  node 
using  either  the  common  network  command  language  or  one  of  the 
available  local  command  languages. 


3.  Do  not  provide  a  common  network  command  leuiguage  but  allow  any 
command  language  In  the  network  provided  that  there  are 
translators  written  from  that  command  language  to  every  other 
command  language  In  the  network. 


Do  not  provide  for  a  complete  common  network  command  lemguage, 
but  define  a  subset  of  network  commands  with  which  any  command 
language  In  the  network  must  have  1-1  correspondence.  For 
example  there  should  be  a  one  to  one  correspondence  between 
file  copy  commands  and  mall  and  message  sending  facilities. 


5.  Do  not  provide  for  a  network  command  language  nor  insist  that 
translators  be  written  between  different  command  langauges. 


4. 4. 2. 2  Load-Based  C<»and  Languages 

Since  the  distributed  command  language  contains  Information  from  the 
entire  set  of  concurrent  processes  making  up  a  user's  Job,  it  can  be  used  to 
convey  much  more  Information  to  the  local  operating  system  than  simply  which 
processes  are  to  run  concurrently,  and  which  processes  oomnunlcate  with  whom. 


We  simply  list  here  some  of  Uie  information  which  can  potentially  be 
added  to  command  language  statements  in  order  to  supply  extra  Information  to 
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the  operating  eyatem  In  order  to  aid  In  aohedullng  and  work  distribution 
decisions: 

Inter-process  Conmunloatlon  MSi^anlsm 
Volume  of  Inter-prooess  Communication 
Command  Location 
Program  Size 

Concurrency  and  Sequentiality 
4.4.3  to  Jis,  Addressed 

The  definition  and  Implementation  of  a  distributed  command  language  has 
many  problems  associated  with  it.  Some  are  general  cases  of  problems  faced  by 
an  operating  system  command  language,  such  as  specification  of  resource 
requirements,  while  others  are  unique  to  the  environment  of  an  FDPS.  Taking 
the  view  that  the  distributed  command  language  should  be  user-oriented,  the 
problems  described  here  are  mainly  those  from  a  user's  point  of  view. 

Visibility jq£  Network.  A  friendly  network  does  not  Interfere  with 
users,  but  provides  services  to  meet  users*  requirements.  The  network  should 
logically  be  considered  as  a  passive  communications  medium,  and  should  be  as 
transparent  to  the  user  as  possible. 

Richness  jcg.  Slmnlleltv.  A  command  language  which  expresses  all  the 
capabilities  of  a  system  Is  powerful,  but  most  likely  difficult  to  use.  A 
simple  command  language,  while  easy  to  use,  may  not  have  the  ability  to  handle 
complex  requests. 

Tailored  Common  Language.  As  mentioned  previously,  a  standard 

NOSCRL  Is  being  worked  on  by  several  groups,  where  Ideally  a  user  would  need 
only  the  common  Icmguage  to  specify  requirements,  allowing  portability  among 
processors  In  a  network  or  between  netwcsrks.  On  the  other  hand,  user 
requirements  vary,  for  example  between  the  application  package  user  and  the 
application  designer.  Therefore  a  single  standard  oonmunl cations  language  may 
be  too  complex  to  adequately  serve  the  range  of  users,  requiring  either 
different  languages  or  different  levels  of  access  within  the  language. 

Location  ^  rnnmanii  Lan^ftgft.  A  command  language  can  be  Implemented 
either  within  one  or  more  existing  programming  languages,  or  separately  as  a 
language  by  Itself.  The  main  argument  for  Inclusion  within  programming 
languages  Is  that  users  need  to  know  only  the  programming  language  —  but  many 
users  don't  want  to  know  any  programming  language.  The  question  also  arises 
as  to  which  programming  languages  should  have  the  command  language  embedded. 
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A  command  language  may  be  considered  as  a  means  of  specifying  the  environment 
In  which  a  program  Is  to  run,  In  which  case  It  would  be  used  separately  from 
application  programs. 

Command  Syntax.  The  syntax  of  a  command  language  has  an  Important 
effect  on  Its  usability.  One  main  difference  In  style  Is  the  use  of  either 
keyword  parameters  or  positional  parameters.  Keyword  parameters  are  self- 
documenting,  but  take  longer  to  enter.  Use  of  menus  Is  similar.  Some 
approaches  allow  either,  such  as  procedure  calls  In  Ada. 

Human  Factors.  Under  this  category  fall  the  various  aspects  of  man- 
machine  Interfaces  which  make  the  distributed  command  language  easier  to  use 
from  a  human  user's  standpoint,  as  well  as  within  applications  which  are 
prepared  by  humans.  New  terminal  types,  such  as  those  with  Intelligence  or 
graphics  capabilities,  open  new  possibilities  for  making  machines  easier  to 
use. 

Compilation  Translation.  A  command  language  could  either  be 

translated,  where  commands  are  acted  on  immediately  upon  entry,  or  compiled, 
which  allows  the  system  to  have  all  the  user's  requests  available  to  perhaps 
make  better  decisions  on  such  things  as  resource  allocation.  Also,  In  what 
fc»'m  should  command  procedures  be  kept  — •  compiled  or  in  the  original  source? 
An  extensive  discussion  of  this  topic  Is  presented  in  Appendix  B. 

Interfaces.  It  Is  becoming  fairly  widely  accepted  that  there  are 
several  different  "levels"  of  user  interfaces  required  in  a  system.  (For  a 
recent  discussion  see  Roger  W.  Ehrlch  and  H.  Rex  Hartson,  "On  Effective 
Software  Development  Methodology,"  £A2M»  Vol.  25,  No. 5,  May,  1982,  pp.350- 

351.)  It  is  not  yet  clear  exactly  what  user  levels  of  control  will  be 
required;  however,  the  need  for  at  least  the  following  appears  obvious  for 
distributed  systems: 

•  System  Programmer 

•  System  Manager 

•  Maintenance  Personnel 
••  Hardware 

••  Software 

•  Applications  Systems  Programmers 

e  "Occasional"  Programmers 

•  Non- Programming  Users 
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4.4.4  ProDoaed  Solutlona 

Several  approaches  to  the  definition  of  a  distributed  oonmand  language 
will  be  described.  These  are  not  disjoint,  but  describe  major  concepts  which 
address  the  problems. 

Translation  Between  Svs terns  CEraylUngerVeller75].  A  common  command 
language  (NOSCRL)  Is  defined,  but  used  only  for  communication  between 
different  operating  systems.  The  user  sees  a  system  with  which  he  is 
familiar,  and  enters  commands  In  Its  format.  The  commands  are  translated  Into 
the  command  language(s}  of  the  system(s)  on  which  the  user's  task  Is  to  be 
performed,  In  two  stages.  First,  the  oonmiands  are  translated  Into  the  comnon, 
or  Intermediate,  language  by  the  system  whose  host  lemguage  was  used.  Then, 
the  commands  In  the  conBon  language  are  translated  on  the  destination 
system(s)  Into  the  local  command  language,  and  executed.  Each  system  requires 
only  two  "half- translators”,  to  translate  between  the  local  conmiand  language 
and  the  common  command  language. 

Translation  Into  Systems.  [Dakin75],  [Newman75],  [NewmanBOaJ.  Commands 
from  users  are  entered  In  a  conmon  command  language  (NOSCRL),  and  then  ture 
translated  Into  the  command  language  of  the  target  system.  They  may  be 
translated  on  the  local  system,  or  on  the  target  system  where  they  are 
executed. 

Message-Oriented  Model  [LauerNeedham79]>  A  system  built  on  the  message- 
oriented  model  Is  comprised  of  processes  which  pass  messages  among  themselves, 
as  opposed  to  the  procedure-oriented  model.  In  which  processes  move  between 
contexts.  The  message-oriented  model  is  closer  In  structure  to  a  distributed 
system  than  the  procedure-oriented  model,  thou^  the  latter  can  be  provided 
through  "communicating  veurlables”  [HertweckSO].  The  message-oriented  model 
allows  the  general  definition  of  a  user  as  a  process  which  autonomously  com¬ 
municates,  and  may  be  a  program  or  person  at  a  terminal.  Commands  from  a  user 
are  messages  directed  at  processes  which  manage  resources.  Such  a  command 
language  designed  around  message  protocols  could  have  a  single  specification, 
regardless  of  whether  a  command  originated  from  a  terminal  or  a  program. 

Virtual  Protocols  have  been  used  successfully  to  Interface  system  com¬ 
ponents,  especially  In  the  area  of  communication  systems.  Based  on  a  virtual 
model,  such  as  a  file  or  terminal,  commands  are  defined  to  carry  cut  Its 
operation. 
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Abstract  Machines  [UngerSO],  [KuglerSO],  [HopperSO].  The  objective  of 
this  approach  Is  to  "define  a  communication  Interface  between  a  computer 
system  and  all  Its  users,  which  enables  every  user  to  solve  his  problems  on  a 
semantic  level  appropriate  to  his  problems  (and  not  necess2U'lly  to  those  of 
the  computer  system)."  [UngerSO].  Looking  at  a  computing  system  from  a 
user's  perspective,  the  user  sees  an  operating  system  command  language,  which 
Is  Implemented  by  the  operating  system.  A  taxonomy  of  universal  operating 
system  facilities  can  be  established,  and  a  given  operating  system  can  be 
considered  as  Implementing  a  subset  of  these  facilities.  A  user's 
requirements  do  not  depend  on  which  machine  Is  to  be  used.  However,  the 
operating  system  command  language  varies  considerably  from  one  machine  to 
another,  forcing  the  user  to  understand  how  his  requirements  can  be  met 
through  the  subset  of  the  universal  operating  system  facilities  available 
through  the  unique  command  language  on  a  particular  system. 

The  approach  taken  by  use  of  an  abstract  machine  (AM)  Is  to  define  the 
operating  system  facilities  in  a  consistent  manner  as  a  "basic  abstract 
machine"  (BAM),  which  Is  Implemented  In  a  layer  above  the  operating  system. 
The  facilities  of  the  BAM  are  then  the  basis  for  the  definition  of  several 
AMs,  each  tailored  fa-  a  particular  user.  The  AMs  are  portable  as  the  BAM  is 
a  standard,  regeu'dless  of  which  machine  it  is  implemented  on.  (thou^ 
presumably  all  machines  do  not  provide  all  resources). 

In  a  network  environment,  this  concept  is  applied  to  the  definition  of 
"virtual  network  machines",  which  provide  resources  of  one  or  more  real 
machines  to  users.  This  approach  allows  portability  of  a  user's  tailored  com¬ 
mand  language  among  network  machines,  but  does  not  require  all  users  to  follow 
the  same  command  language  as  does  a  standard  NOSCRL. 

4.4.5  Inlttal  ADProaohas 

Research  into  this  capability  can  be  conducted  analytically,  or 
experimentally  In  conjunction  with  either  or  both  of  the  two  testbeds. 
Problems  encountered  in  the  guest  system  testbed  will  be  more  driven  by  the 
state  of  current  technology,  while  the  native  testbed  will  offer  an 
opportunity  to  work  in  a  less  constrained  environment. 

The  figures  below  are  for  a  pilot  project  to  investigate  the 
requirements  of  distributed  command  languages  analytically  at  first,  with  a 
'simulated'  environment  to  be  constructed  in  the  second  yecu?  to  evaluate 
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results.  More  extensive  research  can  be  done  on  this  topic  on  construction 

with  testbed  construction. 

4*4*6  Reaouroaa  and  Schedule 

To  cover  a  24  month  period: 

Manpower  man-months 

Senior  Staff  4 

(2  ffl-m/year) 

Junior  Staff  1 2 

(6  BHm/year) 

Programmers  30 

(3  at  5  m-m/year) 

Secretarial  Support  6 

(3  m-m/year) 

Equipment 

Computer  Time  Substantial 

Timing 

First  period  of  12  months: 

Analysis  and  design  of  command  language  capabilities. 

Last  period  of  12  months: 

Construction  and  evaluation  of  simulator  for  distributed 
conmand  languages. 
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4.5  LOAD  MAMAOBMaiT 

4.5.1 

Local  scheduling  Involves  deciding  when  to  assign  resources  (e.g. , 
physical  menory,  processor)  to  eligible  processes  so  that  goals  of  response 
time  and  throughput  are  met.  This  section  Is  concerned  only  with  the  resour¬ 
ces  available  at  the  local  site.  The  section  on  work  distribution  discusses 
other  alternatives. 

4.5*1 *1  Background 

There  are  two  basic  types  of  scheduling  :  deterministic  and 
probabilistic.  Deterministic  scheduling  Is  only  possible  when  the  processing 
time  for  each  process  or  task  Is  known  beforehand.  A  significant  body  of  work 
has  been  done  In  this  area.  Coffinan  and  Denning  [Cofftaan  &  Denning,  73] 
provide  a  good  Introduction.  Much  of  the  work  has  been  done  for  multiple 
processors,  and  thus  may  be  applicable  to  distributed  systems  where  processing 
times  are  known  a  priori.  However,  this  Is  usually  not  the  case.  When 
processing  times  are  not  known  beforehand,  probabilistic  scheduling  Is  used. 
Probabilistic  scheduling  has  many  heuristic  characteristics.  In  "classical” 
systems.  It  involves  techniques  such  as  round-robin  and  priority  queueing 
disciplines.  Several  schemes  have  been  proposed  for  the  special  environment 
of  distributed  systems.  One  such  Is  the  concept  of  oosoheduling  In  Medusa 
[Ousterhaut,  et  al. ,  80].  A  task  force  (l.e. ,  a  set  of  cooperating  processes) 
Is  said  to  be  coscheduled  If  all  of  Its  runnable  processes  are  simultaneously 
scheduled  for  execution  on  their  respective  processors.  Thus,  most 
Interprocess  comnunlcatlon  can  proceed  immediately,  since  the  communicating 
processes  are  both  currently  running.  (Note  that  this  assumes  relatively 
short  delays  for  oomnunloatlons.) 

Another  approach  Is  the  vara  sohadullng  technique  used  In  MICROS 
[vanTllb(»*g  &  Vlttle,  81].  This  Involves  structuring  processes  Into  trees, 
where  each  level  of  the  tree  consists  of  managers  for  the  level  below. 
Scheduling  Is  done  hierarchically,  with  each  level  of  managers  scheduling  the 
level  below.  A  "wave”  propogates  down  the  tree  with  each  high-level  schedul¬ 
ing  decision. 

4.5.1.2  ProUema  and  Initial  Ipproaohea 

Neither  of  the  approaches  was  specifically  designed  for  distributed 
aystema.  To  evaluate  their  utility  In  this  environment,  a  possible  direction 
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for  research  would  be  to  monitor  processes'  requests  for  activation,  seeking 
relationships  between  groups  of  processes.  Such  relationships  would  then  be 
used  to  develop  and  evaluate  new  scheduling  algorithms. 

4.5.2  JIfiCk  Distribution 

4.5.2. 1  Dssoription 

Work  distribution  for  FDPS's  Involves  assignment  of  resources  (e.g. , 
files,  devices,  processors)  so  that  goals  of  system  utilization,  response 
time,  and  throughput  are  met.  This  problem  has  long  been  studied  in  the 
context  of  centralized  systems.  In  the  case  of  an  FDPS,  however,  the  problems 
Introduced  by  the  nature  of  the  FDPS  environment  make  the  problem  much  more 
difficult  (e.g. ,  time  delays  In  communication,  possible  failures,  autonomy, 
security,  etc.).  Other  problems  closely  related  to  workload  distribution  cure: 
process  and  file  migration,  node  autonomy,  decentralized  control  (decision 
making) . 

4. 5. 2. 2  Background 

In  order  to  make  the  best  use  of  the  multiplicity  of  resources  available 
In  an  FDPS,  there  must  be  some  coherent  policy  set  forth  and  enforced.  In  a 
situation  where  each  site  has  all  the  resources  It  will  ever  need,  work 
distribution  may  not  be  necessary.  However,  If  this  is  the  case,  then  It  Is 
most  likely  that  each  site  will  not  always  be  using  all  of  Its  local  resour¬ 
ces.  Some  form  of  work  distribution  Is  necessary  in  order  to  utilize  these 
Idle  resources.  Previous  work  In  this  area  can  be  broken  down  into  two 
categories:  placement  and  assignment  [Jones  &  Schwarz,  80]  [Sharp,  82].  The 
placement  problem  Involves  the  physical  placement  of  resources  (l.e. ,  files) 
in  the  network.  Allocation  of  processes  to  processors  constitutes  the 
assignment  problem.  The  placement  problem  has  received  the  most  attention 
[Buckles  A  Hardin,  79]  [Casey,  72]  [Chang  &  Liu,  79]  [Chen  A  Akoka,  80]  [Chu, 
69]  [Chu,  73]  [Irani  A  Rhabbaz,  79]  [Levin  A  Morgan,  75].  The  approaches 
range  from  optimal  graph  theoretic  solutions  (of  limited  applicability)  to 
heuristic  algorithms  and  simulation.  The  assignment  problem  has  received 
somewhat  less  attention.  Most  of  the  work  [Rao,  et  al.,  79]  [Stone,  77] 
[Stone,  78]  [Stone  A  Bokharl,  78]  has  been  of  a  graph  theoretic  nature,  and 
the  algorithms  quickly  beccme  ccmputationally  intractable  when  extended  to 
even  a  moderate  number  of  processors.  The  general  assignment  problem  has  been 
shown  to  be  NP-oomplete  [Kratzer  A  Hammerstrom,  80].  Casey  and  Shelness 
[Casey  A  Shelness,  77]  have  proposed  a  heuristic  *  that  shows  promise. 
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Sinulatlons  [Sharp,  82]  have  also  been  done. 

4.5. 2. 3  Probleas 

A  problen  requiring  study  Is  transparent  process  migration;  l.e. ,  how  to 
relocate  a  process  such  that  the  process  Is  unaware  that  it  has  been  moved. 
There  Is  a  large  amount  of  state  Information  associated  with  an  active  process 
that  must  be  maintained  consistently  during  the  transport.  Also,  If  the 
process  Is  communicating  with  other  processes,  the  time  spent  In  migration  can 
cause  other  processes  to  timeout  (considering  the  migrating  process  to  have 
failed)  unless  precautions  are  taken.  Possibly,  a  more  profitable  approach 
might  be  to  consider  migrating  "transactions”  as  units  of  work,  rather  than 
entire  processes. 

A  second,  more  fundamental  problen  Is  that  of  the  decision  apparatus. 
The  decision  to  distribute  load  can  be  made  by  a  logically  centralized  "work¬ 
load  controller",  or  by  one  of  a  number  of  nodes  if  a  decentralized  scheme  Is 
used.  Equally  Important  Is  what  Information  Is  used  in  making  the  decision, 
and  how  that  Information  Is  maintained.  Many  Issues  In  this  area  are  discus¬ 
sed  by  Jensen  In  [JensenSi]. 

4.5.3  iBltiftl  Approftohcia 

Perhaps  the  most  important  things  to  know  when  designing  a  workload 
distribution  mechanism  (or.  In  deciding  if  one  is  Indeed  necessary)  are  the 
oh^u'acterlstlcs  of  the  workload  expected  tor  the  system  in  question.  A 
distribution  scheme  that  works  well  In  an  Interactive  software  development 
environment  may  be  completely  Inadequate  for  a  real-time  command-and-control 
system.  Also,  a  scheme  that  can  handle  both  environments  may  be  too  slow  to 
be  useful  to  either.  Thus,  the  workload  characteristics,  together  with  the 
purposes  and  goals  of  the  system  at  hand,  will  greatly  Impact  the  design  of 
the  workload  distributor. 

Modeling  and  simulation  can  be  used  to  achieve  this  characterization, 
but  the  best  method  Is  probably  direct  measurements  from  an  existing  system 
that  Implaents  the  same  (or  similar)  functions.  Extrapolation  can  then  be 
made  to  Include  any  enhanced  functionality  to  be  provided  by  the  new  system 
(here,  modelling  and  simulation  are  necessary). 

The  computational  intractability  of  the  distribution  problem  requires 
the  use  of  heuristics  in  any  practical  system.  The  only  way  to  evaluate  these 
heuristic  algorithms  Is  through  the  use  of  simulations  or  experiments.  Some 
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of  the  more  promising  work  [Casey  &  Shelness,  77]  has  used  this  approach. 

Simulation  as  a  technique  for  evaluating  algorithms  In  distributed 
systems  Is  limited  In  the  extent  to  which  It  can  capture  the  volatility  and 
dynamic  nature  of  the  system,  and  the  extent  that  It  can  detect  unexpected 
transient  effects  which  might  be  vital  to  a  working  system.  Therefore,  we 
propose  Incorporating  research  Into  load  management  Into  the  construction  of  a 
testbed,  to  take  place  after  the  Initial  testbed  Is  constructed.  If  no  test-> 
bed  Is  to  be  constructed,  we  propose  that  a  hardware  configuration,  similar  to 
that  of  an  FDPS,  be  constructed.  Upon  this  hardware,  a  truly  distributed 
simulation  system  can  be  built.  Such  a  system  will  capture  the  effects  of 
line  transmission  delays  and  Internal  queueing  In  the  nodes. 

4.5.4  Relationahlp  TOffi  XfiCk 

Load  management  Is  part  of  the  Issue  of  general  resource  management.  As 
such.  It  Is  associated  closely  with  command  languages.  A  relationship  Is  also 
seen  with  data  management,  since  the  Information  upon  which  resource  alloca¬ 
tion  decisions  are  made  Is  distributed  In  nature. 

4.5.5  EflaourMfl  mM  SflheAttla 

To  cover  a  24  month  period: 


Manpower  man-months 

Senior  Staff  4 

(2  m-m/year) 

Junior  Staff  1 2 

(6  m-m/year) 

Programmers  30 

(3  at  5  m-m/year) 

Secretarial  Support  6 

(3  m-m/year) 


Equipment 

Computer  Time  A  Loosely-ooupled  multiple 

processor  testbed 


Timing 

First  period  of  12  months: 

Build  distributed  slmiilator. 


Last  period  of  12  months: 

Conduct  experiments  in  distributed  load  manasnent. 
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4.6  META  SYSTEMS 

A  distributed  operating  system  (DOS)  is  a  set  of  software  capabilities 
which  manage  the  resources  of  a  distributed  processing  system.  The  DOS 
requires  support  In  the  form  of  local  operating  systems  on  the  various  nodes 
In  the  system.  The  Implementation  of  such  an  operating  system  can  proceed  In 
two  ways:  the  local  operating  systems  which  will  provide  the  support  for  the 
DOS  may  be  designed  from  scratch,  with  the  functionality  required  for  the  DOS 
In  mind,  or  the  DOS  may  be  Implemented  as  a  layer  above  already  existing 
operating  systems. 

A  DOS  Implemented  with  the  latter  approach  Is  called  a  guest  system  or 
meta-system.  The  local  operating  systems  used  by  guest  systems  were  not 
necessarily  designed  to  support  anything  other  than  local  access  of  resources. 

This  section  describes  some  of  the  problems  encountered  by  the  guest 
system  approach,  particularly  those  systems  based  on  heterogeneous  host 
systems.  Also  some  approaches  taken  to  solve  these  piroblems  are  discussed. 

4.6.1  BaokETOund 

The  agency  responsible  for  providing  PDFS  users  with  services  is  the 
distributed  operating  system  (DOS).  A  DOS  differs  from  a  traditional  operat¬ 
ing  system  In  that  Its  fundamental  concern  is  not  with  the  sharing  or  mul¬ 
tiplexing  of  resources  ([PEEB80],  [KIMB76],  [FORS80],  and  [WATS80]).  Rather, 
the  DOS  makes  services  available  to  users,  and  establishes  global  policies 
concerning  the  use  of  these  services.  For  example,  if  there  Is  a  class  of 
services  providing  essentially  the  same  function,  the  DOS  decides  which  of 
this  set  a  user  is  allocated. 

The  DOS  Is  also  responsible  for  locating  services  for  the  user.  Users 
of  the  system  should  be  able  to  ask  for  services  by  logical  names. 

Because  a  PDFS  Is  composed  of  several  processors,  programs  written  for 
these  systems  may  take  advantage  of  the  parallelism  available.  Such  programs 
would  be  composed  of  modules  which  communicated  by  passing  messages.  The  DOS 
Is  responsible  for  providing  Inter-process  communication  (IFC).  IFC  should 
appear  the  same,  regardless  of  whether  the  processes  involved  are  using  the 
same  processor  or  on  different  processors. 

The  DOS  Is  also  responsible  for  distributed  process  management.  This 
Involves  the  creation  and  destruction  of  processes  at  a  global  level.  It  may 
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also  be  necessary  for  the  DOS  to  be  able  to  block  processes. 

The  DOS  must  provide  for  the  protection  of  resources  from  Incorrect  or 
unauthorized  usage,  similar  to  the  service  provided  by  traditional  operating 
systems.  The  service  provided  by  the  DOS  may  be  more  complicated,  so  that  a 
user  may  not  have  access  to  a  service  that  resides  on  a  particular  host,  but 
may  be  allowed  to  use  a  similar  service  on  a  different  host.  For  example,  if 
a  host  on  the  system  is  being  used  for  developmental  purposes,  access  to  its 
local  resources  may  be  restricted. 

In  order  to  provide  these  functions,  the  DOS  must  rely  on  local  operat¬ 
ing  systems  present  on  each  of  the  host  machines  In  the  system.  It  is  these 
local  operating  systems  which  will  provide  the  traditional  operating  system 
functions  (memory  management,  scheduling,  and  so  forth)  and  manage  the  local 
resources  of  a  machine.  In  order  for  the  FDPS  user  to  make  use  of  the  ser¬ 
vices,  the  DOS  must  request  the  service  from  one  of  the  local  operating 
systems. 

4.6.2  Quest 

The  local  operating  systems  for  the  host  machines  may  be  designed  from 
scratch  with  the  express  purpose  of  supporting  the  DOS.  The  DOS  is  then 
Implemented  as  part  of  these  systems.  This  is  called  the  base  level  approach 
[TH0M78].  This  approach  allows  the  functions  of  the  DOS  to  be  considered  at 
the  local  operating  system  level.  The  resulting  system  can  be  very  efficient, 
since  the  host  operating  systems  and  the  DOS  are  designed  to  mesh  together 
Into  a  cohesive  system.  Indeed,  the  prime  advantage  of  this  approach  is  the 
possibility  of  integrating  the  functions  of  the  DOS  and  host  operating  systems 
to  some  degree. 

The  main  handicap  to  this  approach  is  the  cost  of  development.  Not  only 
must  a  code  for  distributed  functions  be  written,  but  the  code  necessary  to  han¬ 
dle  the  traditional  operating  system  services  must  also  be  written.  The  meta- 
system  [TH(Kr8]  or  guest  system  approach  avoids  this  drawback  by  using  exist¬ 
ing  operating  systems  as  the  host  systems.  Using  the  meta-system  approach, 
the  DOS  becomes  a  layer  of  software  that  runs  on  top  of  the  local  operating 
systems.  It  Is  essentially  an  application  program  which  transforms  requests 
for  distributed  services  Into  the  appropriate  requests  for  services  that  the 
local  operating  system  provides. 
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One  advantage  of  the  guest  system  approach  was  already  mentioned.  Indeed, 
much  early  research  Into  distributed  systems  assumed  the  guest  system  approach 
for  this  very  reason  ([KIMB76],  CF0RS77],  [MILL77]).  However,  another 
advantage  for  taking  this  approach  Is  that  most  existing  systems  have 
considerable  Investment  In  application  software  which  would  become  useless  If 
the  underlying  operating  system  were  thrown  away.  If  the  purpose  of  designing 
the  distributed  system  Is  to  allow  users  access  to  a  wide  range  of  such 
software,  then  the  guest  syst^  approach  would  seem  more  advantageous. 

The  NSW,  tor  example,  wets  designed  to  allow  users  access  to  the  wide 
range  of  services  which  exist  on  the  various  hosts  In  the  system  ([GELL77]  and 
[MILL77].  The  system  was  designed  to  run  on  Tenex,  Multlcs,  and  OS/360 
systems.  NSW  was  Intended  to  allow  users  at  various  locations  In  the  system 
to  share  software  development  tools. 

ADAPT  [PEEB80]  Is  a  guest  system  which  Is  Intended  to  run  on  VAX/VMS. 
ADAPT  Is  an  object  model  system.  ADAPT  sees  resources  as  typed  objects  that 
can  be  operated  upon  by  a  limited  number  of  functions.  ADAPT  attempts  to  use 
existing  software  as  much  as  possible,  so  It  does  not  take  the  object  model  to 
Its  full  extreme,  using  relatively  large  structures,  such  as  flies,  as  the 
limits  of  granularity. 

Desperanto  Is  a  guest  system  designed  to  run  on  a  variety  of  systems 
CMAMR82a].  The  system  views  distributed  software  as  a  set  of  modules.  A 
module  Is  a  set  of  data  objects  and  a  set  of  functions  which  can  manipulate 
the  objects. 

4.6.3  aMWQh  ProblMfl 

The  basic  problem  faced  by  a  guest  DOS  Is  the  translation  of  local  ser¬ 
vices  Into  FDPS  services.  The  most  conmon  solution  to  this  problem  Is  to 
require  that  each  host  In  the  system  support  some  sort  of  monitoring  process 
which  Is  responsible  for  requesting  services  from  the  host  system  ([MAMR82a], 
[PEEB80],  [FORS78]).  This  monitoring  process  Is  the  Interface  between  the 
FDPS  user  and  the  underlying  host  system.  Since  the  monitoring  process  Is 
running  at  what  Is  normally  the  application  level  of  the  host  system,  there 
may  be  problems  performing  the  required  services.  The  monitoring  process  must 
be  able  to  start  and  stop  processes  for  FDPS  users.  This  may  require  more 
access  than  application  programs  normally  have.  Also,  the  monitoring  process 
must  be  able  to  access  services  for  a  remote  request.  This  may  require  exten- 
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ded  access  or  the  ability  to  start  up  a  process  for  that  user  on  the  host 
system. 

A  problem  faced  by  a  DOS  in  a  heterogeneous  system  is  that  of  providing 
a  uniform  interface  for  FDPS  users.  With  the  base-level  approach,  this 
problem  is  lessened  by  the  fact  that  the  local  operating  systems  are  designed 
to  support  distributed  systems.  The  local  services  provided  by  the  veirious 
machines  in  the  system  should  appear  similar  to  the  DOS.  Guest  systems, 
however,  must  take  the  various  environments  presented  by  the  host  operating 
systems  and  transform  them  into  the  single  environment  presented  by  the  FDPS. 
One  utility  that  is  required  here  is  a  common  command  language  that  can  be 
used  by  the  distributed  user  to  Interact  with  the  system.  Using  this  command 
l2Lnguage,  the  FDPS  user  should  be  able  to  work  with  the  system  in  a  uniform 
manner  regardless  of  the  location  in  the  network  of  the  services  required 
[GRAY? 9].  Because  of  the  nature  of  a  FDPS,  the  command  language  required  must 
support  services  such  as  creation  of  processes  and  the  ability  to  specify 
interconnections  of  processes.  If  the  existing  command  language  of  a  host 
machine  does  not  support  these  functions,  then  the  implementation  of  a 
distributed  command  language  may  not  be  a  simple  translation  from  the  FDPS 
command  language  to  the  host  command  language. 

In  addition,  there  may  be  other  differences  between  the  host  systems, 
which  the  DOS  must  hide.  This  requires  the  DOS  to  be  responsible  for  hiding 
differences  in  representation  of  data  and  inconsistencies  in  services  provided 
by  the  various  host  systems  in  addition  to  the  services  normally  provided  by 
the  DOS.  The  major  problem  is  the  naming  of  services  and  resources.  Each  of 
the  local  operating  systems  provides  its  own  local  name  space  or  name  spaces, 
each  with  its  own  conventions.  The  DOS  makes  services  available  to  the  user  by 
logical  names  which  reveal  nothing  of  the  service's  location.  Distributed 
processes  must  ad  so  be  able  to  locate  services  and  other  processes  without 
regard  to  the  service's  or  process'  location.  This  requires  the  DOS  to 
provide  a  global  name  space.  The  global  name  space  must  be  able  to  handle 
cases  such  as  generlcally  named  services  and  replicated  files. 

One  approach  to  this  problem  is  to  have  the  DOS  provide  a  directory  ser¬ 
vice  ([FORS80]  and  [PEEB80]).  This  service  will  perform  translations  from 
FDPS  names  to  local  names.  The  host  then  would  receive  requests  using  these 
local  names.  This  approach  allows  other  information  to  be  included  with 
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names,  such  as  access  Inforaatlon. 

Providing  communication  among  distributed  processes  Is  another  task 
which  Is  complicated  by  heterogeneous  host  systems.  This  service  must  provide 
conversion  of  data  between  two  hosts  If  necessary.  This  may  Include  such  sim¬ 
ple  functions  as  translating  from  one  character  code  to  another.  It  may  also 
involve  more  complex  Issues.  For  example,  In  the  Desperanto  system,  software 
Is  represented  as  modules,  which  In  turn  are  composed  of  data  objects  and 
functions  which  operate  on  them.  The  data  objects  may  be  represented  In 
different  ways  on  the  various  hosts  In  the  system,  but  the  distributed  process 
should  not  be  aware  of  this.  A  solution  presented  In  [MAMR82b]  is  to  provide 
an  intermediate  representation  for  data  object  and  have  the  DOS  perform  the 
conversion  from  the  local  representations  to  the  Intermediate  representation 
and  vice  versa. 

4.6.4  Proposed  Heaearoh 

Research  Into  this  capability  will  form  that  basis  of  one  of  the  two 
proposed  testbeds.  We  propose  a  project  to  construct  a  guest  operating  system 
to  run  on  at  least  two  machines.  In  conjunction  with  this  project,  the 
separate  areas  of  data  management  and  file  management.  Interprocess  com¬ 
munication,  and  command  languages  will  be  addressed.  Once  complete,  the  test¬ 
bed  will  support  research  Into  these  areas,  as  well  as  resource  managonent  and 
load  management. 

4.6.5  Relationship  ^  Other  XSEfi  Work 

The  design  of  a  FDPS  using  the  guest  system  approach  may  provide  not 
only  Information  as  to  the  feasibility  of  this  approach,  but  edso  provide 
Insights  Into  the  Implementation  Issues  of  the  base-level  approach. 
Consideration  should  be  made  as  to  the  level  of  support  of  FDPS  services  which 
may  be  expected  from  existing  operating  systems.  Indeed,  the  criteria  which 
make  one  operating  system  more  suitable  as  a  host  system  than  another  should 
be  explored.  And  since  operating  systems  may  provide  more  support  In  one 
class  of  services  than  another,  Identification  of  the  relative  Importance  of 
each  class  to  the  FDPS  is  Important.  Information  in  these  areas  may  allau  the 
selection  of  host  systems  which  are  more  suitable  to  the  FDPS  [F0RS80].  Also, 
clarification  of  these  criteria  may  provide  Insif^t  into  the  design  of  new 
host  operating  systems  for  support  of  FDPS. 
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4.6. 5*1  Distributed  Software  Tools  -  DSVT 

Work  of  specific  interest  to  this  topic  is  the  "DSWT  Project."  DSWT 
consists  of  one  or  more  software  tools  subsystems  (SWT)  which  communicate  to 
locate  and  utilize  resources  and  make  decisions.  DSWT  takes  the  "meta 
approach"  to  the  design  and  implementation  of  a  network  operating  system. 
DSWT  will  give  us  a  distributed  environment  for  the  network  of  PRIMES  in  the 
ICS  computer  lab.  The  DSWT  project  will  be  extended  to  a  heterogeneous 
environment  where  other  nodes  will  have  Implemented  the  entire  set  of  tools  of 
perhaps  only  a  subset. 

4. 6. 5. 2  Distributed  Compiling  Shells 

The  Shell  in  an  operating  system  is  the  Command  Interpreter,  the  com¬ 
ponent  of  the  operating  system  which  parses  the  user's  comnand  line, 
instantiates  the  appropriate  processes,  and  sets  up  communication  between 
them,  monitors  their  execution,  takes  appropriate  steps  when  errors  occur,  and 
"cleans-up"  when  the  processes  terminate.  In  most  systems,  shells  are 
interpretive;  that  is,  they  parse  one  user  command,  instantiate  the  correct 
processes  to  carry  it  out,  and  when  they  have  terminated,  a  return  is  made  to 
the  conmand  Interpreter  in  order  to  carry  out  the  next  user  command.  In  a 
Fully  Distributed  System,  this  is  inappropriate  since  it  is  Intended  to  take 
advantage  of  the  Inherent  parallelism  of  the  system  by  executing  user  Jobs  as 
concurrent  systems  of  processes,  executing  in  parallel.  Therefore  a  new  style 
of  shell  must  be  developed  which  takes  in  an  entire  user’s  command  file, 
consisting  of  several  command  lines,  parsing  the  entire  command  file,  from 
which  a  task  graph  as  described  above,  can  be  built,  and  distributing  the 
results  of  this  parsing  step,  as  subgraphs,  to  the  Local  Operating  Systems 
which  have  to  carry  out  each  subgraph  derived  from  the  central  task  graph. 

4.6.6  BfltPHTQM  ABIL  aflhfldUlW 

To  cover  a  24  month  period; 


Manpower  man-months 

Senior  Staff  4 

(2  m-m/year) 

Junior  Staff  1 2 

(6  m-m/year) 

Programmers  30 

(3  at  5  m-m/year) 

Secretarial  Support  6 

(3  BHm/year) 
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Equipment 

Computer  Time  Substantial  on  heterogeneous 

environment 


Timing 

First  period  of  18  months: 

Research  Into  capabilities  required. 
Construction. 

Last  period  of  6  months: 

Experimental  use  of  testbed. 
Evaluation. 
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4.7  m  network  architecture  _  susmu.  PBQTQC0L8  MU  IMTBBFACBS 

4.7.1  ggaor^ption 

The  "network  architecture"  is  the  master  plan  defining  the  rules  governf 
ing  the  overall  structure  of  the  distributed  system  at  all  levels  of  interac¬ 
tion.  The  network  architecture  defines  how  resources  will  be  provided  and 
utilized  in  a  distributed  processing  environment.  The  network  architecture 
consists  of  the  complete  definition  of  the  following  items: 

e  Standard  Interlayer  Interfaces 
••  Interface  data  units 
••  Service  data  units 
««  Interface  control  information 
«•  Services  provided  across  interface 
••  Service  request  fcrmats  (procedure  calls) 

•  Standard  peer  protocols 
••  Protocol  data  units 

«•  Protocol  control  information 
«•  Communication  standards 

4.7.2  Beakground 

Experience  has  shown  that  the  definition  and  enforcement  of  a  completely 
defined  network  architecture  is  essential  to  the  development  of  "good" 
distributed  systems.  Initial  efforts  in  these  areas  have  usually  followed  the 
path  of  ^  hoc  design,  and  the  results  clearly  reflect  this  approach. 

The  need  becomes  especially  apparent  when  there  is  a  requirement  to 
extend  or  expand  the  system. 

4.7.3  w 

There  are  a  number  of  problems  in  this  area  that  must  be  addressed. 

e  It  is  impossible  to  completely  define  the  network  architecture 
prior  to  system  Implementaton  because  of  the  inability  to 
identify  a  priori  of  all  of  the  system  features  that  must  be 
defined. 

•  It  is  difficult  to  verify  compliance  with  either  Interface  or 
protocol  standards. 

e  It  is  difficult  to  verify  completeness  and  aooura<^  of  Interface 
and  protocol  definitions. 

e  A  major  problon  is  educating  the  designers  and  Implementers  as 
to  the  pervasiveness  of  the  network  architecture  definition. 

The  itfi.  feoto  definition  of  the  eu?chlteoture  by  "unrelated"  and 
"uncontrolled"  "low-level"  decisions  must  be  prevented. 
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4.7.4  irsBaaai  aolutioa 

There  cure  no  good  models  or  examples  of  solutions  to  this  problem  — - 
even  poor  ones. 

The  recommendation  Is  that  the  basic  feature  of  the  network  architecture 
be  defined  as  early  as  possible  and  that  a  "network  architecture 
administrator"  be  established  to  continually  monitor  the  development  of  the 
network  architecture  definition. 

4.7.5  Relatlonahlp  ^  Other  FDPS  Work  880*0 

Work  In  this  area  should  be  started  at  the  earliest  point  possible  and 
continue  In  parallel  with  all  stages  of  the  development  of  a  distributed 
system. 

4.7.6  Reaourcea  and  Sahedule 

To  cover  a  24  month  period: 

Manpower  man-months 

Senior  Staff  4 

(2  m-m/year) 

Junior  Staff  6 

(3  m-m/year) 

Secretarial  Support  8 

(4  ffl-ffl/year) 
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<».8  QPBltATlQMAL  fiPPJTOBT  CPKSLgSIQN 

We  have  discussed  six  capabilities  for  the  operational  support  of 
distributed  systems.  While  these  capabilities  can  be  studied  Independently, 
we  propose  that  two  testbeds  be  created  as  vehicles  for  research.  The  design 
and  construction  processes  will  provide  a  basis  for  research  Into  capabilities 
such  as  file  systems  and  data  management,  interprocess  communication,  and  com¬ 
mand  languages,  while  the  completed  testbeds  can  be  used  for  experimentation 
Into  capabilities  such  as  Interprocess  communication,  command  languages, 
resource  allocation,  and  load  management.  Note  that  these  research  issues 
overlap— some  capabilities  will  benefit  from  both  the  construction  process  and 
the  ccanpleted  systems.  The  construction  process  will  also  provide  excellent 
opportunity  for  study  of  the  process  and  techniques  for  constructing 
distributed  operating  systems. 

The  two  testbeds  differ  fundamentally  In  their  approaches.  The  first 
takes  are  'guest'  approach,  basing  the  distributed  operating  system 
Implementation  on  application  programs  run  by  existing  host  operating  systems. 
This  approach  Is  essentially  based  on  expediency,  and  suffers  Inherent 
limitations  In  Its  capabilities.  Nonetheless,  guest  systems  have  promise, 
because  they  emphasize  heterogeneity  and  can  couple  considerably  different 
equipment.  We  propose  that  we  Implement  a  testbed  simultaneously  on  a  Prime 
550  and  a  VAX  II/78O.  Funds  for  computer  usage  are  included  in  the  estimates. 

The  second  testbed  Is  to  be  a  'native',  or  resident  distributed  operat¬ 
ing  system.  In  the  current  economic  climate,  but  wishing  nonetheless  to  do 
the  experiment  with  more  than  trivial  machinery,  we  have  allowed  funds  for 
five  (5)  Perq  workstations  from  Three  Rivers  Computer  Corporation,  to  serve  as 
a  base  for  construction.  These  machines  were  designed  for  this  environment, 
and  have  many  advantages  which  make  them  eminently  suitable  -  principally,  the 
ability  to  redefine  the  machine  architecture  through  microcode. 

Both  testbeds,  during  and  after  the  design  euid  construction  phases,  will 
facilitate  the  study  of  most  of  the  capabilities  described  In  this  report. 
This  feature  makes  the  testbed  concept  extremely  profitable. 

4.8.1  gristing  Hasearoh  At  Georgia  Tech 

The  testbed  construction  process  corresponds  to  two  projects  currently 
underway  at  Georgia  Tech.  The  first,  the  'guest'  system,  corresponds  to  the 
Distributed  Software  Tools  project,  under  Professor  R.  J.  LeBlanc.  The 
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second  approach,  the  'native*  system,  corresponds  to  the  Clouds  project,  under 
Professor  M.S.  McKendry  (See  Appendices  H, I, J).  Currently,  neither  project 
receives  targetted  funding,  a  factor  that  constrains  the  rate  of  progress. 

The  Clouds  project  currently  uses  two  Perq  workstations.  We  request 
five  (5)  additional  stations,  to  aid  during  development  for  implementation, 
and  after  development  for  experimentation. 

4.8.2  'Gueat'  System  Hasourcaa 

To  cover  a  30  month  period: 

Manpower  man-months 


Senior  Staff 
(4  m-m/year) 
Junior  Staff 
(9  m-m/year) 
Programmers 

(5  at  5  m-m/year) 
Secretarial  Support 
(6  m-m/year) 


Equipment 


Computer  Time 


Substantial  or  heterogeneous 
testbed 


Timing 


4.8.3  1 


First  period  of  18  months: 

Research  into  capabilities  required. 
Construction  of  testbed. 

Last  period  of  18  months: 

Experimental  design  and  use  of  testbed  for 
experimentation. 

a»  Svatam  Raaouroea 

To  cover  a  36  month  period: 


Manpower 


man-months 


Senior  Staff 
(4  m-m/year) 
Junior  Staff 
(9  m-m/year) 
Programmers 

(6  at  5  nHm/year) 
Secretarial  Support 
(6  m-m/year) 


Equipment 
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Computer  Time  Substantial  on  testbed  of 

5-8  work  stations 

Timing 

First  period  of  24  months: 

Design  'global*  operating  system; 
study  support  capabilities. 

Last  period  of  18  months: 

Experimental  design; 

Execute  experiments  In  'dynamic* 
capabilities. 
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SBCTIOI  5 
SOMMABT 

5.1  jam  jflLS  m  msuismi  ss.  samm  ouhNUias 

It  is  almost  an  unnecessary  repetition  of  a  truism  to  state  that 
developers  need  solid  and  Immediate  user  feedback  on  the  functionality 
provided,  the  design,  and  the  utilization  of  any  support  capability  while  It 
Is  being  developed.  However,  achieving  this  goal  is  at  best  moderately 
difficult;  more  often  It  Is  almost  Impossible. 

As  Geury  Nutt  said  In  [Riddle]  "The  user  interface  to  softw8U*e  develop¬ 
ment  tools  Is  sometimes  as  Important  as  their  Integrated  functionality.”  Our 
experience  at  Georgia  Tech  Indicates  that  an  even  stronger  statement  Is  more 
appropriate.  The  user  Interface  la  usually  the  single  most  Important  factor 
governing  the  overall  veilue  of  any  software  development  tool  or  design 
facility. 

If  the  support  capability  In  question  Is  a  simple,  self-contained  unit 
with  reasonably  well-defined  Input  and  output  such  as  a  text  editor,  user 
feedback  can  be  obtained  as  Increments  of  the  support  capability  are 
developed.  In  the  example  of  the  editor,  additional  features  can  be  added, 
command  syntax  can  be  changed,  and  output/display  formats  can  be  changed 
Incrementally  with  relative  ease.  Also,  prototypes  of  the  Intermediate 
products  can  be  released  to  users  for  actual  use  and  evaluation  to  provide 
guidance  In  the  development  and  refinement  of  later  versions. 

On  the  other  hand.  If  the  support  capability  Is  a  large  and  complex 
facility  such  as  a  simulator  or  data  base  design  analyzer,  It  Is  very 
difficult  to  obtain  user  feedback  at  *lntermedlate-stages”  of  development. 
There  Is  usually  nothing  to  utilize,  even  on  a  trial  basis,  until  the  support 
facility  has  been  completely  Implanented.  This  point  is  extremely  Important 
and  applicable  to  many  of  the  support  facilities  covered  In  this  report  since 
users  will  have  had  little  experience  with  similar  tools  on  which  to  base 
intermediate  Judgements. 

Comments  similar  to  those  given  In  the  paragraph  above  also  apply  to 
obtaining  user  feedback  on  the  operational  support  capabilities.  Only  In  this 
case  the  problem  is  even  worse  for  tx>w  the  complete  oneretianai  system  must  be 
implemented,  at  least  In  prototype  form,  to  allow  user  evaluation. 
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Providing  user  Input  In  the  form  of  detailed  perfonnance  and  operational 
requirements  specifications  becomes  Increasingly  Important  for  the  three  major 
classes  of  support  capabilities  discussed  here,  l.e.  software  development 
tools,  design  support  facilities,  and  operational  support  capabilities;  but, 
at  the  same  time  It  becomes  Increasingly  difficult  to  define  the  specifics  of 
that  user  Input. 
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5.2  aniQ8ATIQH  SL  ^PPQHT  capabilitibs 

There  are  several  groupings  or  collections  of  support  capabilities 
that  are  closely  related  to  one  another  either  In  function  or  with  respect  to 
Input  and  output.  The  desirable  goal  for  organizing  these  related  "tools”  and 
establishing  their  Interrelationships  has  been  referred  to  as  "Integration.” 
However,  "integration"  has  been  used  to  describe  at  least  three  different 
levels  or  methods  of  organization  (Tom  Love  In  "Discussion"  In  [Riddle]): 

e  The  tools  reside  on  the  same  system  (a  "toolbox") 

e  The  output  of  one  tool  Is  valid  Input  to  another  (a  "workbench") 

e  Each  tool  has  knowledge  of  what  other  tools  may  have  done  or  be 
capable  of  doing  (a  "capable  assistant") 

Our  experience  at  Georgia  Tech  has  shown  that  the  features  and 
capabilities  of  at  least  the  second  level  are  essential.  It  Is  extremely  con¬ 
venient  for  the  output  of  one  tool  process  to  be  directly  acceptable  as  input 
to  another.  TWo  of  the  major  obstacles  to  user  acceptance  of  individual 
software  development  tools  have  been 

e  Peculiarities  (l.e. ,  non-compatible  differences)  In  the  formats 
of  their  Input  and  output,  and 

e  Peculiarities  (l.e.,  non-unlfcrmlty)  in  the  syntax  and  semantics 
of  their  command  languages  and  other  aspects  of  the  user  Inter¬ 
faces  required  to  utilize  to  "tool". 

As  Important  design  goal  is  to  avoid  both  of  these  forms  of 
"peculiarities. "  In  addition.  If  all  tools  utilize  a  slnf^e,  common  format 
for  both  Input  and  output,  the  usability  of  the  various  tools  is  greatly  aug¬ 
mented  by  flexibility  In  the  Intereoonneotlon  of  various  tools. 

In  addition  to  the  ability  to  freely  Interconnect  software  development 
tools  by  the  transferlng  of  "output  products"  to  other  "inputs”,  there  should 
be  a  hierarchy  of  support  capabilities  that  provide  a  transfer  of  information 
between  the  various  tools  and  other  capabilities.  Achieving  this  Is  certainly 
going  to  be  more  difficult  than  providing  simple  Inteoonneotlvlty. 
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5.3  IMPOAIAHCE  OF  PRODDCTIYITY  AS  A  GOAL 

Perhaps  ntx'e  fimdamental  than  any  other  aspect  when  considering  the 
evaluation  of  system  support  capabilities  is  the  cost  vs.  Increased  produc¬ 
tivity  tradeoff.  Whether  viewing  single  processor  systems,  multiprocessors, 
or  fully  distributed  systems  it  is  cost  of  the  time,  labor,  etc.  in  creating 
a  tool  (whether  analysis  based,  design  or  implementation  based,  or  other)  and 
its  expected  payoff  (in  programmer  productivity,  machine  throughput,  com¬ 
munication  costs  or  other)  which  determine  whether  the  support  capability  is 
worth  creating. 


Examining  the  history  of  both  programming  languages  and  operating 
systems  shows  that  over  time  it  has  become  desirable  to  raise  the  "intel¬ 
ligence"  level  of  these  support  tools.  In  both  oases,  we  have  delegated  more 
and  more  of  the  lower  level  details  to  the  machine  Itself.  For  example,  we 
first  had  machine  code,  then  assemblers,  then  compilers,  then  compiler  com¬ 
pilers,  etc.  This  trend  is  presently  moving  faster  than  at  any  other  time, 
for  the  capacities  of  modern  machines  are  rising  so  rapidly.  We  simply  expect 
more  from  computers  now. 


The  usual  support  tools  of  most  present-day  commercial  systems  are  very 
primitive  compared  to  what  is  dictated  by  the  cost  vs.  utility  tradeoff 
discussed  above.  Some  of  this  "poor"  support  could  be  changed  rather  easily, 
others  do  not  as  yet  have  enough  productivity  value  to  overcome  their  costs. 
For  example,  even  though  bit  map  displays  (raster  scan)  and  positioning 
devices  are  clearly  faster  to  use,  their  current  cost  has  prevented  wide¬ 
spread  use.  Whether  in  hardware  or  software,  it  is  Important  to  locate  the 
support  items  which  will  clearly  maximize  the  overall  gain. 


Unfortunately,  distributed  systems  design,  implementation,  and  operation 
are  still  very  much  research  topics.  It  appears  to  us  that  our  ability  to 
accurately  predict  productivity  payoffs  tor  support  capabilities  implementing 
new  concepts  in  not  yet  defined  environments  is  not  yet  feasible. 
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5.4  TBiMafQBTiBiLnr  jgataaag  CAPABnmga 

It  ie  highly  dealrable  to  have  the  ability  to  be  able  to  transport  all 
of  the  distributed  system  support  capabilities  discussed  here  from  one  operat¬ 
ing  environment  to  another.  The  types  of  obstacles  Inhibiting  this  are  not 
very  much  different  from  those  encountered  with  centralized  systems  for  the 
software  development  support  tools;  however,  the  problems  are  quite  different 
as  to  both  scope  and  magnitude  with  respect  to  the  design  support  and 
operational  support  capabilities. 

We  at  Georgia  Tech  have  had  a  large  amount  of  experience  In  constructing 
an  Integrated  set  of  software  development  tools  (the  Georgia  Tech  Software 
Tools  Subsystem)  and  "transporting"  that  subsystem  from  one  environment  to 
another.  The  problems  of  Incompatible  language  Implementation  and  features 
can  be  overcome  fairly  easily,  especially  utilizing  the  editors  available  In 
the  tool  set.  The  major  problems  encountered  have  to  do  with  transporting  the 
process  control  involved  and  defining  a  suitable  standard  format  that  can  be 
utilized  for  both  the  output  and  Input  of  each  Individual  tool. 

Some  of  the  design  support  capabilities,  such  as  simulators,  can  be 
transported  with  a  reasonable  amount  of  work.  However,  those  capabilities 
that  Interact  directly  with  the  target  system,  such  as  monitors,  are  probably 
not  transportable  at  all.  Of  course,  the  "concepts"  are  transportable.  It  Is 
Just  the  Implementation  that  Is  probably  too  specialized  for  use  elsewhere. 
Support  capabilities  such  as  the  designer  workbenches  can  probably  be 
transported  with  a  reasonable  amount  of  effort  to  run  on  a  different  proces¬ 
sor.  However  In  this  case,  the  characteristics  of  the  target  environment  may 
be  deeply  embedded  In  the  details  of  operation  of  the  workbench  processes. 

Operational  support  capabilities  are,  by  this  very  nature  and  purpose, 
hl^ly  oriented  to  a  specific  target  environment.  Again,  it  probably  Is  only 
the  concepts  that  are  easily  transferred. 
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5.5  BYALnAIIQH  SI  SJUSBl  CAPABIUTIBS 

The  evaluation  of  support  capabilities  for  distributed  systems  Is  not 
much  different  from  the  evaluation  of  similar  support  for  centralized  systems. 
The  target  environments  certednly  have  major  differences,  but  ev5duatlon  of 
capabilities  such  as  these  Is  most  often  .heavily  Influenced  by  the  "user  side" 
rather  than  the  "output  side. " 

Evaluation  most  often  relies  primarily  on  subjective  ratings  of  factors 
such  as 

•  Learnablllty 

•  Utility 

•  Functionality  provided/supported 

•  Reliability  of  tool  operation  and  product 

•  Performanoe  of  tool  and  its  produce 

•  Integration  of  various  tools 

•  User  acceptability  (for  instance  compared  to  "Not  Invented  here" 
problems) 
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5.6 


r  OF  OPKBATIQIIAL  SDPPQRT  CAPABILITIBS 


Research  on  distributed  systems  at  Georgia  Tech  (as  well  as 
elsewere)  has  Indicated  that  it  Is  difficult  to  proceed  past  even  the  most 
rudimentary  research  without  the  experience  of  designing,  building,  and 
operating  a  distributed  system.  While  we  have  been  able  to  do  some 
experimentation  on  our  Initial  FDPS  testbed  consisting  of  five  Prime  com¬ 
puters,  we  have  concluded  that  a  hardware/software  testbed  designed 
exclusively  for  experimentation  Is  required. 


The  Clouds  project  Is  undertaking  the  design  and  construction  of  this 
testbed.  Clouds  Is  being  constructed  fcr  a  group  of  Perq  workstations  connec¬ 
ted  by  a  10Mbps  Ethernet.  Since  the  Ethernet  also  links  other  equipment  In 
our  computer  laboratory.  In  particular  the  Prime  computers,  the  new  testbed 
will  be  fully  Integrated  with  existing  facilities. 


The  Clouds  project  will  proceed  In  three  phases.  During  Phase  1 ,  design 
and  Initial  construction,  operational  support  capabilities  will  be  studied  and 
developed.  Once  the  Initial  testbed  Is  functional.  Phase  2  will  entail 
evaluation  and  refinement  of  operational  support  capabilities,  and  development 
of  software  support  capabilities.  Finally,  Phase  3  will  involve  the  exploita¬ 
tion  of  the  testbed.  This  will  Involve  study  of  all  support  capabilities,  and 
will  also  Involve  experimental  research  into  real  time  control  systems,  per¬ 
sonal  computing  environments,  office  aut<»iation  systems,  and  distributed 
databases,  which  are  all  applications  of  the  testbed. 

A  Qouds  status  report  Is  Included  as  Appendix  H  of  this  document. 
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5.7  i2E  JBZXQBK  ARCHlTgCTDlffi 

The  requirement  for  a  netw<»'k  architecture  that  fully  defines  standard 
protocols  and  Interfaces  was  discussed  In  Section  4  as  a  specific  operational 
support  capability.  The  critical  Importance  of  the  network  architecture  to 
the  overall  success  of  any  distributed  processing  system  Indlctes  that  It 
should  be  given  much  m(»'e  attention  than  Just  consideration  as  "another" 
operational  support  capability. 

The  development  of  our  ability  to  organize  the  specification  of  the 
design  of  a  distributed  system  Is  probably  the  most  Important  advance  made  In 
this  area  since  the  Inception  of  the  concept  of  distributed  processing 
systems.  At  least  one  of  the  authors  of  this  report  has  been  Involved  with 
the  design  Implementation  and  operation  of  a  number  of  distributed  processing 
systems,  and  he  feels  that  Hifi  .dftY.ftl0Jnant  Si£.  A  gfifid  network  ar.tthitflQtUTS  ja&d. 
Its  control  through  the  development  cycle  la  tha  moat  important  contribution 
Iq  itXfiraU 
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5.8  jistawassi  paioain  m  aBiBcrm  ssi?gm  Qugjjjxas 

There  €u?e  a  number  of  criteria  that  one  might  utilize  In  developing  a 
ranking  of  importance  of  the  various  Individual  support  capabilities  described 
In  this  report.  Some  of  these  criteria  are: 

e  Difficulty  of  research  problems  to  be  addressed  and  solved. 

«  Length  of  time  required  for  complete  develoimient  of  the  support 
capability. 

•  "Position”  of  the  specific  support  capability  on  the  "critical 
path"  of  the  overall  system  develoimient  schedule. 

e  Anticipated  "value"  of  the  support  capability  In  Improving 
system  performance  ccnslder  factors  such  as 

••  Response  time 

«•  Reliability 

•e  Robustness/Fault-Tolerance 
•e  Resource  utilization 
ee  Effectiveness  of  system  control 
•e  etc. 

The  major  problem  facing  any  attempt  to  priori torlze  the  support 
capabilities  Is  that  almost  none  of  the  criteria  listed  above  produce  the  same 
answers;  In  facti  some  of  the  criteria  are  Internally  Inconsistent  in  the 
ordering  they  suggest.  Further,  several  of  the  criteria  are  directly 
contradictory. 

5.8.1  nfciH— d  in  This  Henopt 

Since  one  of  the  original  goals  of  this  project  was  the  actual 
Implementation  of  the  "highest  priority”  support  capabilities,  the  criteria 
utilized  here  to  order  them  has  been  their  position  on  the  critical  path  — — 
Just  how  essential  Is  the  capability  in  an  actual  Implementation.  In  the 
lists  glvw)  below,  those  capabilities  designated  "hipest  priority"  represent 
the  minimum  subset  essential  to  Implement  a  basic  version  of  a  loosely-coupled 
distributed  processing  system. 

5.8.2  2cUVCit£  Uflik 

e  Blihemt  Priority  —  laaeatlal/Flrat 
ee  Standard  Architecture,  Protocols,  and  Interfaces 
ee  Distributed  Systems  Testbed 
ee  Distributed  Pile  and  Data  Management  System 
ee  Langjuage  support  for  Robust  Distributed  Programs 
ee  Compilation  TOobnlques  for  Distributed  ProgrMis 
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ee  Distributed  Resource  Management 
Distributed  Resource  Access 
•ee  Distributed  Resource  Allocation 
eee  Distributed  Process  Execution 
••  Distributed  Load  Manager 

eee  Distributed  Command  Language  (Initial  capabilities  only) 
ee  Distributed  Interprocess  Comnunlcatlon 

••  Distributed  Execution  Monitor  (IPC  monitoring  as  a  minimum) 
e  Lower  Priority  — •  "Highly  Useful" 

••  Distributed  Command  Language  (Full  capabllltes) 
ee  Distributed  Design  Language 
ee  All  remaining  System  Design  Support  Facilities 
e  Lowest  Priority  "Also  Useful" 

••  Distributed  Compilers 

•e  Compiler  Development  Tools  for  Heterogeneous  Systems 
••  Software  Version  Management 
ee  Cost  Estimation  and  Control 
ee  Guest  System  Testbed 
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5.9  HgFKHKMCKS 

[RlddleSO]  Riddle,  W.E.  and  R.E.  Fairley,  Software  DeveloDment  Toola, 
Springer-Verlag,  Berlin,  1980  ,  277<«-vili  pp.  (Proceedings  of  a  work¬ 
shop  on  Software  Development  tools  held  at  Plnegree  Park,  Colorado, 
May,  1979.  Workshop  emphasized  pre- Implementation  phases  of  Software 
development.)  Of  particular  Interest: 


MISSION 
of 

Rome  Air  Devebprmnt  Center 

RAOC  plans,  and  zxe.culeA  A.&6eaA.ch,  dzveloprmnt,  t&it  and 
6el&cte.d  acqalifjUon  pA.ogAam6  In  6uppoat  Command,  Control 
Communications  and  Intelligence,  {ch)  activities.  Technical 
and  enginecAing  support  utithin  oAeas  o^  technical  competence 
provided  to  ESP  PAcgAom  O^^ices  IPO^)  and  othea  ESP 
elements .  The  pAincipal  technical  mls^sion  oJiexis  oAe 
communications,  electcomagnetic  guidance  and  contAol,  sua- 
veUlance  o^  gAound  a.nd  aeAospace  objects,  intelligence  data 
collection  and  handling,  information  system  technology, 
ionospheAA.c  pAopagatcon,  solid  state  sciences,  micAouXLve 
physics  and  electAonic  AelLability ,  maintainability  and 
compatibility. 


